megatron.model.distributed.DistributedDataParallel#

class megatron.model.distributed.DistributedDataParallel(module, accumulate_allreduce_grads_in_fp32, use_contiguous_buffers)#

Bases: DistributedDataParallelBase

DDP with contiguous buffers options to storre and accumulate gradients. This class:

  • has the potential to reduce memory fragmentation.

  • provides the option to do the gradient accumulation in a type other than the params type (for example fp32)

Parameters:
  • module – input model.

  • accumulate_allreduce_grads_in_fp32 – if true do the gradient accumulation and the gradient all-reduce all in in float32. If this option is true, we require use_contiguous_buffers to be true too.

  • use_contiguous_buffers – if true, use a contiguous buffer to store the gradients.

allreduce_gradients()#

Reduce gradients across data parallel ranks.

zero_grad_buffer()#

Set the grad buffer data to zero. Needs to be called at the begining of each iteration.