megatron.schedules.custom_backward#

megatron.schedules.custom_backward(output, grad_output)#

Directly call C++ autograd engine.

To make the ‘deallocate_output_tensor’ (above) optimization work, the C++ autograd engine must be called directly, bypassing Pytorch’s torch.autograd.backward. Pytorch’s ‘backward’ checks that the output and grad have the same shape, while C++’s ‘backward’ does not.