megatron.optimizer.optimizer.MegatronOptimizer#
- class megatron.optimizer.optimizer.MegatronOptimizer(optimizer, clip_grad, log_num_zeros_in_grad, params_have_main_grad, use_contiguous_buffers_in_local_ddp, models)#
Bases:
ABC
- allreduce_embedding_grads(args)#
All-reduce both word and position embeddings.
- allreduce_layernorm_grads(args)#
All-reduce layernorm grads (for sequence parallelism).
- allreduce_position_embedding_grads(args)#
All-reduce position_embeddings grad across first (encoder) and split (decoder) stages to ensure that position embeddings parameters stay in sync. This should only run for T5 models with pipeline parallelism.
- allreduce_word_embedding_grads(args)#
All-reduce word embedding grads.
Reduce grads across first and last stages to ensure that word_embeddings parameters stay in sync. This should only run for models that support pipelined model parallelism (BERT and GPT-2).
- gather_model_params(args, timers)#
For the case of a non-distributed-optimizer, there is nothing to do here.
- abstract get_loss_scale()#
The output should be a cuda tensor of size 1.
- get_model_parallel_group()#
Default returned here, but the distributed optimizer overrides this.
- reduce_model_grads(args, timers)#
All-reduce all grads, and all-reduce embeddings.
- abstract reload_model_params()#
Refreshes any internal state from the current model parameters. Call whenever the parameters are changed outside of the optimizer. For example, when we load a model from a checkpoint without loading the optimizer, the model parameters are updated but for fp16 optimizer with main parameters, the main parameters need to also be updated.
- scale_loss(loss)#
Simple scaling.