megatron.optimizer.optimizer.MegatronOptimizer#

class megatron.optimizer.optimizer.MegatronOptimizer(optimizer, clip_grad, log_num_zeros_in_grad, params_have_main_grad, use_contiguous_buffers_in_local_ddp, models)#

Bases: ABC

allreduce_embedding_grads(args)#: All-reduce both word and position embeddings.

allreduce_layernorm_grads(args)#: All-reduce layernorm grads (for sequence parallelism).

allreduce_position_embedding_grads(args)#: All-reduce position_embeddings grad across first (encoder) and split (decoder) stages to ensure that position embeddings parameters stay in sync. This should only run for T5 models with pipeline parallelism.

allreduce_word_embedding_grads(args)#

All-reduce word embedding grads.

Reduce grads across first and last stages to ensure that word_embeddings parameters stay in sync. This should only run for models that support pipelined model parallelism (BERT and GPT-2).

gather_model_params(args, timers)#: For the case of a non-distributed-optimizer, there is nothing to do here.

abstract get_loss_scale()#: The output should be a cuda tensor of size 1.

get_model_parallel_group()#: Default returned here, but the distributed optimizer overrides this.

reduce_model_grads(args, timers)#: All-reduce all grads, and all-reduce embeddings.

abstract reload_model_params()#: Refreshes any internal state from the current model parameters. Call whenever the parameters are changed outside of the optimizer. For example, when we load a model from a checkpoint without loading the optimizer, the model parameters are updated but for fp16 optimizer with main parameters, the main parameters need to also be updated.

scale_loss(loss)#: Simple scaling.