megatron.model.distributed.DistributedDataParallel#

class megatron.model.distributed.DistributedDataParallel(module, accumulate_allreduce_grads_in_fp32, use_contiguous_buffers)#

Bases: DistributedDataParallelBase

DDP with contiguous buffers options to storre and accumulate gradients. This class:

has the potential to reduce memory fragmentation.

provides the option to do the gradient accumulation in a type other than the params type (for example fp32)

Parameters:

module – input model.
accumulate_allreduce_grads_in_fp32 – if true do the gradient accumulation and the gradient all-reduce all in in float32. If this option is true, we require use_contiguous_buffers to be true too.
use_contiguous_buffers – if true, use a contiguous buffer to store the gradients.

allreduce_gradients()#: Reduce gradients across data parallel ranks.

zero_grad_buffer()#: Set the grad buffer data to zero. Needs to be called at the begining of each iteration.