fastvideo.v1.layers.linear

`fastvideo.v1.layers.linear`#

Module Contents#

Classes#

`ColumnParallelLinear`	Linear layer with column parallelism.
`LinearBase`	Base linear layer.
`LinearMethodBase`	Base class for different (maybe quantized) linear methods.
`MergedColumnParallelLinear`	Packed linear layers with column parallelism.
`QKVParallelLinear`	Linear layers for the attention’s QKV transformation.
`ReplicatedLinear`	Replicated linear layer.
`RowParallelLinear`	Linear layer with row parallelism.
`UnquantizedLinearMethod`	Linear method without quantization.

Functions#

adjust_scalar_to_fused_array

For fused modules (QKV and MLP) we have an array of length N that holds 1 scale for each “logical” matrix. So the param is an array of length N. The loaded_weight corresponds to one of the shards on disk. Here, we slice the param based on the shard_id for loading.

Data#

`WEIGHT_LOADER_V2_SUPPORTED`
`logger`

API#

class fastvideo.v1.layers.linear.ColumnParallelLinear(input_size: int, output_size: int, bias: bool = True, gather_output: bool = False, skip_bias_add: bool = False, params_dtype: Optional[torch.dtype] = None, quant_config: Optional[fastvideo.v1.layers.quantization.base_config.QuantizationConfig] = None, output_sizes: Optional[list[int]] = None, prefix: str = '')[source]#

Bases: fastvideo.v1.layers.linear.LinearBase

Linear layer with column parallelism.

The linear layer is defined as Y = XA + b. A is parallelized along its second dimension as A = [A_1, …, A_p].

Parameters:

input_size – first dimension of matrix A.
output_size – second dimension of matrix A.
bias – If true, add bias.
gather_output – If true, call all-gather on output and make Y available to all GPUs, otherwise, every GPU will have its output which is Y_i = XA_i
skip_bias_add – This was added to enable performance optimizations where bias can be fused with other element-wise operations. we skip adding bias but instead return it.
params_dtype – Data type for the parameters.
quant_config – Quantization configure.
output_sizes – list of output sizes packed into one output, like for QKV the list would be size 3.
prefix – The name of the layer in the state dict, including all parents (e.g. model.layers.0.qkv_proj)

Initialization

Initialize internal Module state, shared by both nn.Module and ScriptModule.

extra_repr() → str[source]#

forward(input_: torch.Tensor) → tuple[torch.Tensor, Optional[torch.nn.parameter.Parameter]][source]#

weight_loader(param: torch.nn.parameter.Parameter, loaded_weight: torch.Tensor) → None[source]#

weight_loader_v2(param: torch.nn.parameter.Parameter, loaded_weight: torch.Tensor) → None[source]#

class fastvideo.v1.layers.linear.LinearBase(input_size: int, output_size: int, skip_bias_add: bool = False, params_dtype: Optional[torch.dtype] = None, quant_config: Optional[fastvideo.v1.layers.quantization.base_config.QuantizationConfig] = None, prefix: str = '')[source]#

Bases: torch.nn.Module

Base linear layer.

Parameters:

input_size – input dimension of the linear layer.
output_size – output dimension of the linear layer.
bias – If true, add bias.
skip_bias_add – If true, skip adding bias but instead return it.
params_dtype – Data type for the parameters.
quant_config – Quantization configure.

Initialization

Initialize internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(x: torch.Tensor) → tuple[torch.Tensor, Optional[torch.nn.parameter.Parameter]][source]#

class fastvideo.v1.layers.linear.LinearMethodBase[source]#

Bases: fastvideo.v1.layers.quantization.base_config.QuantizeMethodBase

Base class for different (maybe quantized) linear methods.

abstract apply(layer: torch.nn.Module, x: torch.Tensor, bias: Optional[torch.Tensor] = None) → torch.Tensor[source]#: Apply the weights in layer to the input tensor. Expects create_weights to have been called before on the layer.

abstract create_weights(layer: torch.nn.Module, input_size_per_partition: int, output_partition_sizes: list[int], input_size: int, output_size: int, params_dtype: torch.dtype, **extra_weight_attrs) → None[source]#

Create weights for a linear layer. The weights will be set as attributes of the layer.

Parameters:

layer – The layer that is using the LinearMethodBase factory.
input_size_per_partition – Size of the weight input dim on rank X.
output_partition_sizes – Sizes of the output dim of each logical weight on rank X. E.g., output_partition_sizes for QKVLinear is a list contains the width of Wq, Wk, Wv on rank X.
input_size – Size of the input dim of the weight across all ranks.
output_size – Size of the output dim of the weight across all ranks.
params_dtype – Datatype of the parameters.

class fastvideo.v1.layers.linear.MergedColumnParallelLinear(input_size: int, output_sizes: list[int], bias: bool = True, gather_output: bool = False, skip_bias_add: bool = False, params_dtype: Optional[torch.dtype] = None, quant_config: Optional[fastvideo.v1.layers.quantization.base_config.QuantizationConfig] = None, prefix: str = '')[source]#

Bases: fastvideo.v1.layers.linear.ColumnParallelLinear

Packed linear layers with column parallelism.

Similar to ColumnParallelLinear, but the weight matrix is concatenated along the output dimension. When the weight matrix is loaded, the different partitions are sharded separately.

Parameters:

input_size – input dimension of the linear layer.
output_sizes – list of output dimensions of the linear layer.
bias – If true, add bias.
gather_output – If true, call all-gather on output and make the output available to all GPUs, otherwise, every GPU will have its own output.
skip_bias_add – This was added to enable performance optimizations where bias can be fused with other element-wise operations. we skip adding bias but instead return it.
params_dtype – Data type for the parameters.
quant_config – Quantization configure.
prefix – The name of the layer in the state dict, including all parents (e.g. model.layers.0.qkv_proj)

Initialization

Initialize internal Module state, shared by both nn.Module and ScriptModule.

weight_loader(param: torch.nn.parameter.Parameter, loaded_weight: torch.Tensor, loaded_shard_id: Optional[int] = None) → None[source]#

weight_loader_v2(param: fastvideo.v1.models.parameter.BasevLLMParameter, loaded_weight: torch.Tensor, loaded_shard_id: Optional[int] = None) → None[source]#

class fastvideo.v1.layers.linear.QKVParallelLinear(hidden_size: int, head_size: int, total_num_heads: int, total_num_kv_heads: Optional[int] = None, bias: bool = True, skip_bias_add: bool = False, params_dtype: Optional[torch.dtype] = None, quant_config: Optional[fastvideo.v1.layers.quantization.base_config.QuantizationConfig] = None, prefix: str = '')[source]#

Bases: fastvideo.v1.layers.linear.ColumnParallelLinear

Linear layers for the attention’s QKV transformation.

Linear layers for the linear transformation of the query, key, and value vectors in the attention layer. The weight matrix is concatenated along the output dimension. The layer is parallelized along the head dimension. When the number of key/value heads is smaller than the number of query heads (e.g., multi-query/grouped-query attention), the key/value head may be replicated while the query heads are partitioned.

Parameters:

hidden_size – input hidden state size of the transformer.
head_size – size of each attention head.
total_num_heads – total number of attention query heads.
total_num_kv_heads – total number of attention key/value heads. If None, assume total_num_kv_heads = total_num_heads.
bias – If true, add bias.
skip_bias_add – This was added to enable performance optimizations where bias can be fused with other element-wise operations. we skip adding bias but instead return it.
params_dtype – Data type for the parameters.
quant_config – Quantization configure.
prefix – The name of the layer in the state dict, including all parents (e.g. model.layers.0.qkv_proj)

Initialization

Initialize internal Module state, shared by both nn.Module and ScriptModule.

weight_loader(param: torch.nn.parameter.Parameter, loaded_weight: torch.Tensor, loaded_shard_id: Optional[str] = None)[source]#

weight_loader_v2(param: fastvideo.v1.models.parameter.BasevLLMParameter, loaded_weight: torch.Tensor, loaded_shard_id: Optional[str] = None)[source]#

class fastvideo.v1.layers.linear.ReplicatedLinear(input_size: int, output_size: int, bias: bool = True, skip_bias_add: bool = False, params_dtype: Optional[torch.dtype] = None, quant_config: Optional[fastvideo.v1.layers.quantization.base_config.QuantizationConfig] = None, prefix: str = '')[source]#

Bases: fastvideo.v1.layers.linear.LinearBase

Replicated linear layer.

Parameters:

input_size – input dimension of the linear layer.
output_size – output dimension of the linear layer.
bias – If true, add bias.
skip_bias_add – If true, skip adding bias but instead return it.
params_dtype – Data type for the parameters.
quant_config – Quantization configure.
prefix – The name of the layer in the state dict, including all parents (e.g. model.layers.0.qkv_proj)

Initialization

Initialize internal Module state, shared by both nn.Module and ScriptModule.

extra_repr() → str[source]#

forward(x: torch.Tensor) → tuple[torch.Tensor, Optional[torch.nn.parameter.Parameter]][source]#

weight_loader(param: torch.nn.parameter.Parameter, loaded_weight: torch.Tensor) → None[source]#

class fastvideo.v1.layers.linear.RowParallelLinear(input_size: int, output_size: int, bias: bool = True, input_is_parallel: bool = True, skip_bias_add: bool = False, params_dtype: Optional[torch.dtype] = None, reduce_results: bool = True, quant_config: Optional[fastvideo.v1.layers.quantization.base_config.QuantizationConfig] = None, prefix: str = '')[source]#

Bases: fastvideo.v1.layers.linear.LinearBase

Linear layer with row parallelism.

The linear layer is defined as Y = XA + b. A is parallelized along its first dimension and X along its second dimension as: - - | A_1 | | . | A = | . | X = [X_1, …, X_p] | . | | A_p | - -

Parameters:

input_size – first dimension of matrix A.
output_size – second dimension of matrix A.
bias – If true, add bias. Note that bias is not parallelized.
input_is_parallel – If true, we assume that the input is already split across the GPUs and we do not split again.
skip_bias_add – This was added to enable performance optimization where bias can be fused with other element-wise operations. We skip adding bias but instead return it.
params_dtype – Data type for the parameters.
quant_config – Quantization configure.

Initialization

Initialize internal Module state, shared by both nn.Module and ScriptModule.

extra_repr() → str[source]#

forward(input_) → tuple[torch.Tensor, Optional[torch.nn.parameter.Parameter]][source]#

weight_loader(param: torch.nn.parameter.Parameter, loaded_weight: torch.Tensor)[source]#

weight_loader_v2(param: fastvideo.v1.models.parameter.BasevLLMParameter, loaded_weight: torch.Tensor)[source]#

class fastvideo.v1.layers.linear.UnquantizedLinearMethod[source]#

Bases: fastvideo.v1.layers.linear.LinearMethodBase

Linear method without quantization.

apply(layer: torch.nn.Module, x: torch.Tensor, bias: Optional[torch.Tensor] = None) → torch.Tensor[source]#

create_weights(layer: torch.nn.Module, input_size_per_partition: int, output_partition_sizes: list[int], input_size: int, output_size: int, params_dtype: torch.dtype, **extra_weight_attrs) → None[source]#

fastvideo.v1.layers.linear.WEIGHT_LOADER_V2_SUPPORTED[source]#: [‘CompressedTensorsLinearMethod’, ‘AWQMarlinLinearMethod’, ‘AWQLinearMethod’, ‘GPTQMarlinLinearMetho…

fastvideo.v1.layers.linear.adjust_scalar_to_fused_array(param: torch.Tensor, loaded_weight: torch.Tensor, shard_id: Union[str, int]) → tuple[torch.Tensor, torch.Tensor][source]#: For fused modules (QKV and MLP) we have an array of length N that holds 1 scale for each “logical” matrix. So the param is an array of length N. The loaded_weight corresponds to one of the shards on disk. Here, we slice the param based on the shard_id for loading.

fastvideo.v1.layers.linear.logger[source]#: ‘init_logger(…)’

fastvideo.v1.layers.linear

Contents

fastvideo.v1.layers.linear#

Module Contents#

Classes#

Functions#

Data#

API#

`fastvideo.v1.layers.linear`#