megatron.model.distributed.DistributedDataParallel#
- class megatron.model.distributed.DistributedDataParallel(module, accumulate_allreduce_grads_in_fp32, use_contiguous_buffers)#
Bases:
DistributedDataParallelBase
DDP with contiguous buffers options to storre and accumulate gradients. This class:
has the potential to reduce memory fragmentation.
provides the option to do the gradient accumulation in a type other than the params type (for example fp32)
- Parameters:
module – input model.
accumulate_allreduce_grads_in_fp32 – if true do the gradient accumulation and the gradient all-reduce all in in float32. If this option is true, we require use_contiguous_buffers to be true too.
use_contiguous_buffers – if true, use a contiguous buffer to store the gradients.
- allreduce_gradients()#
Reduce gradients across data parallel ranks.
- zero_grad_buffer()#
Set the grad buffer data to zero. Needs to be called at the begining of each iteration.