megatron.core.tensor_parallel.layers.ColumnParallelLinear#
- class megatron.core.tensor_parallel.layers.ColumnParallelLinear(input_size, output_size, *, bias=True, gather_output=True, init_method=<function xavier_normal_>, stride=1, keep_master_weight_for_test=False, skip_bias_add=False, async_tensor_model_parallel_allreduce=True, params_dtype=torch.float32, use_cpu_initialization=False, perform_initialization=True, gradient_accumulation_fusion=False, sequence_parallel_enabled: bool = False, world_size: int | None = None)#
Bases:
Module
Linear layer with column parallelism.
The linear layer is defined as Y = XA + b. A is parallelized along its second dimension as A = [A_1, …, A_p].
- Parameters:
input_size – first dimension of matrix A.
output_size – second dimension of matrix A.
- Keyword Arguments
bias: If true, add bias gather_output: If true, call all-gather on output and make Y available
to all GPUs, otherwise, every GPU will have its output which is Y_i = XA_i
- init_method: method to initialize weights. Note that bias is always set
to zero.
stride: For the strided linear layers. keep_master_weight_for_test: This was added for testing and should be
set to False. It returns the master weights used for initialization.
- skip_bias_add: This was added to enable performance optimations where bias
can be fused with other elementwise operations. we skip adding bias but instead return it.
async_tensor_model_parallel_allreduce: params_dtype: use_cpu_initialization: gradient_accumulation_fusion: sequence_parallel_enabled:
- forward(input_)#
- Parameters:
input – 3D tensor whose order of dimension is [sequence, batch, hidden]
- Returns:
output
bias