fastvideo.v1.models.parameter
#
Module Contents#
Classes#
Base parameter for vLLM linear layers. Extends the torch.nn.parameter by taking in a linear weight loader. Will copy the loaded weight into the parameter when the provided weight loader is called. |
|
Parameter class for weight scales loaded for weights with block-wise quantization. Uses both column and row parallelism. |
|
Parameter class for weight scales loaded for weights with channel-wise quantization. Equivalent to _ColumnvLLMParameter. |
|
Parameter class for weight scales loaded for weights with grouped quantization. Uses both column and row parallelism. |
|
Parameter class for linear layer weights. Uses both column and row parallelism. |
|
Parameter for model parameters which are packed on disk and support column parallelism only. See PackedvLLMParameter for more details on the packed properties. |
|
Parameter for model weights which are packed on disk. Example: GPTQ Marlin weights are int4 or int8, packed into int32. Extends the ModelWeightParameter to take in the packed factor, the packed dimension, and optionally, marlin tile size for marlin kernels. Adjusts the shard_size and shard_offset for fused linear layers model weight loading by accounting for packing and optionally, marlin tile size. |
|
Parameter class for scales where the number of scales is equivalent to the number of logical matrices in fused linear layers (e.g. for QKV, there are 3 scales loaded from disk). This is relevant to weights with per-tensor quantization. Adds functionality to map the scalers to a shard during weight loading. |
|
Parameter class defining weight_loading functionality (load_row_parallel_weight) for parameters being loaded into linear layers with row parallel functionality. Requires an input_dim to be defined. |
Functions#
Permute a parameterβs layout to the specified input and output dimensions, useful for forcing the parameter into a known layout, for example, if I need a packed (quantized) weight matrix to be in the layout {input_dim = 0, output_dim = 1, packed_dim = 0} then I can call: permute_param_layout_(x, input_dim=0, output_dim=1, packed_dim=0) to ensure x is in the correct layout (permuting it to the correct layout if required, asserting if it cannot get it to the correct layout) |
Data#
API#
- class fastvideo.v1.models.parameter.BasevLLMParameter(data: torch.Tensor, weight_loader: Callable)[source]#
Bases:
torch.nn.Parameter
Base parameter for vLLM linear layers. Extends the torch.nn.parameter by taking in a linear weight loader. Will copy the loaded weight into the parameter when the provided weight loader is called.
Initialization
Initialize the BasevLLMParameter
- Parameters:
data β torch tensor with the parameter data
weight_loader β weight loader callable
- Returns:
a torch.nn.parameter
- load_column_parallel_weight(loaded_weight: torch.Tensor) None [source]#
- load_merged_column_weight(loaded_weight: torch.Tensor, **kwargs) None [source]#
- load_qkv_weight(loaded_weight: torch.Tensor, **kwargs) None [source]#
- load_row_parallel_weight(loaded_weight: torch.Tensor) None [source]#
- class fastvideo.v1.models.parameter.BlockQuantScaleParameter(output_dim: int, **kwargs)[source]#
Bases:
fastvideo.v1.models.parameter._ColumnvLLMParameter
,fastvideo.v1.models.parameter.RowvLLMParameter
Parameter class for weight scales loaded for weights with block-wise quantization. Uses both column and row parallelism.
Initialization
Initialize the BasevLLMParameter
- Parameters:
data β torch tensor with the parameter data
weight_loader β weight loader callable
- Returns:
a torch.nn.parameter
- class fastvideo.v1.models.parameter.ChannelQuantScaleParameter(output_dim: int, **kwargs)[source]#
Bases:
fastvideo.v1.models.parameter._ColumnvLLMParameter
Parameter class for weight scales loaded for weights with channel-wise quantization. Equivalent to _ColumnvLLMParameter.
Initialization
Initialize the BasevLLMParameter
- Parameters:
data β torch tensor with the parameter data
weight_loader β weight loader callable
- Returns:
a torch.nn.parameter
- class fastvideo.v1.models.parameter.GroupQuantScaleParameter(output_dim: int, **kwargs)[source]#
Bases:
fastvideo.v1.models.parameter._ColumnvLLMParameter
,fastvideo.v1.models.parameter.RowvLLMParameter
Parameter class for weight scales loaded for weights with grouped quantization. Uses both column and row parallelism.
Initialization
Initialize the BasevLLMParameter
- Parameters:
data β torch tensor with the parameter data
weight_loader β weight loader callable
- Returns:
a torch.nn.parameter
- class fastvideo.v1.models.parameter.ModelWeightParameter(output_dim: int, **kwargs)[source]#
Bases:
fastvideo.v1.models.parameter._ColumnvLLMParameter
,fastvideo.v1.models.parameter.RowvLLMParameter
Parameter class for linear layer weights. Uses both column and row parallelism.
Initialization
Initialize the BasevLLMParameter
- Parameters:
data β torch tensor with the parameter data
weight_loader β weight loader callable
- Returns:
a torch.nn.parameter
- class fastvideo.v1.models.parameter.PackedColumnParameter(packed_factor: Union[int, fractions.Fraction], packed_dim: int, **kwargs)[source]#
Bases:
fastvideo.v1.models.parameter._ColumnvLLMParameter
Parameter for model parameters which are packed on disk and support column parallelism only. See PackedvLLMParameter for more details on the packed properties.
Initialization
Initialize the BasevLLMParameter
- Parameters:
data β torch tensor with the parameter data
weight_loader β weight loader callable
- Returns:
a torch.nn.parameter
- class fastvideo.v1.models.parameter.PackedvLLMParameter(packed_factor: Union[int, fractions.Fraction], packed_dim: int, **kwargs)[source]#
Bases:
fastvideo.v1.models.parameter.ModelWeightParameter
Parameter for model weights which are packed on disk. Example: GPTQ Marlin weights are int4 or int8, packed into int32. Extends the ModelWeightParameter to take in the packed factor, the packed dimension, and optionally, marlin tile size for marlin kernels. Adjusts the shard_size and shard_offset for fused linear layers model weight loading by accounting for packing and optionally, marlin tile size.
Initialization
Initialize the BasevLLMParameter
- Parameters:
data β torch tensor with the parameter data
weight_loader β weight loader callable
- Returns:
a torch.nn.parameter
- class fastvideo.v1.models.parameter.PerTensorScaleParameter(**kwargs)[source]#
Bases:
fastvideo.v1.models.parameter.BasevLLMParameter
Parameter class for scales where the number of scales is equivalent to the number of logical matrices in fused linear layers (e.g. for QKV, there are 3 scales loaded from disk). This is relevant to weights with per-tensor quantization. Adds functionality to map the scalers to a shard during weight loading.
Note: additional parameter manipulation may be handled for each quantization config specifically, within process_weights_after_loading
Initialization
Initialize the BasevLLMParameter
- Parameters:
data β torch tensor with the parameter data
weight_loader β weight loader callable
- Returns:
a torch.nn.parameter
- class fastvideo.v1.models.parameter.RowvLLMParameter(input_dim: int, **kwargs)[source]#
Bases:
fastvideo.v1.models.parameter.BasevLLMParameter
Parameter class defining weight_loading functionality (load_row_parallel_weight) for parameters being loaded into linear layers with row parallel functionality. Requires an input_dim to be defined.
Initialization
Initialize the BasevLLMParameter
- Parameters:
data β torch tensor with the parameter data
weight_loader β weight loader callable
- Returns:
a torch.nn.parameter
- load_row_parallel_weight(loaded_weight: torch.Tensor) None [source]#
- fastvideo.v1.models.parameter.permute_param_layout_(param: fastvideo.v1.models.parameter.BasevLLMParameter, input_dim: int, output_dim: int, **kwargs) fastvideo.v1.models.parameter.BasevLLMParameter [source]#
Permute a parameterβs layout to the specified input and output dimensions, useful for forcing the parameter into a known layout, for example, if I need a packed (quantized) weight matrix to be in the layout {input_dim = 0, output_dim = 1, packed_dim = 0} then I can call: permute_param_layout_(x, input_dim=0, output_dim=1, packed_dim=0) to ensure x is in the correct layout (permuting it to the correct layout if required, asserting if it cannot get it to the correct layout)