megatron.core.tensor_parallel.layers.ColumnParallelLinear#

class megatron.core.tensor_parallel.layers.ColumnParallelLinear(input_size, output_size, *, bias=True, gather_output=True, init_method=<function xavier_normal_>, stride=1, keep_master_weight_for_test=False, skip_bias_add=False, async_tensor_model_parallel_allreduce=True, params_dtype=torch.float32, use_cpu_initialization=False, perform_initialization=True, gradient_accumulation_fusion=False, sequence_parallel_enabled: bool = False, world_size: int | None = None)#

Bases: Module

Linear layer with column parallelism.

The linear layer is defined as Y = XA + b. A is parallelized along its second dimension as A = [A_1, …, A_p].

Parameters:
  • input_size – first dimension of matrix A.

  • output_size – second dimension of matrix A.

Keyword Arguments

bias: If true, add bias gather_output: If true, call all-gather on output and make Y available

to all GPUs, otherwise, every GPU will have its output which is Y_i = XA_i

init_method: method to initialize weights. Note that bias is always set

to zero.

stride: For the strided linear layers. keep_master_weight_for_test: This was added for testing and should be

set to False. It returns the master weights used for initialization.

skip_bias_add: This was added to enable performance optimations where bias

can be fused with other elementwise operations. we skip adding bias but instead return it.

async_tensor_model_parallel_allreduce: params_dtype: use_cpu_initialization: gradient_accumulation_fusion: sequence_parallel_enabled:

forward(input_)#
Parameters:

input – 3D tensor whose order of dimension is [sequence, batch, hidden]

Returns:

  • output

  • bias