megatron.optimizer.clip_grads.clip_grad_norm_fp32#

megatron.optimizer.clip_grads.clip_grad_norm_fp32(parameters, grads_for_norm, max_norm, norm_type=2, model_parallel_group=None)#
Clips gradient norm of an iterable of parameters whose gradients

are in fp32.

This is adapted from torch.nn.utils.clip_grad.clip_grad_norm_ and added functionality to handle model parallel parameters. Note that the gradients are modified in place.

Parameters:
  • parameters (Iterable[Tensor] or Tensor) – an iterable of Tensors or a single Tensor that will have gradients normalized

  • grads_for_norm (Iterable[Tensor]) – an iterable of Tensors or a single Tensor that will be used for calculating the grad norm.

  • max_norm (float or int) – max norm of the gradients

  • norm_type (float or int) – type of the used p-norm. Can be 'inf' for infinity norm.

  • model_parallel_group (group) – given the nature of the distributed optimizer, this is passed as an argument.

Returns:

Total norm of the parameters (viewed as a single vector).