fastvideo.v1.models.parameter#

Module Contents#

Classes#

BasevLLMParameter

Base parameter for vLLM linear layers. Extends the torch.nn.parameter by taking in a linear weight loader. Will copy the loaded weight into the parameter when the provided weight loader is called.

BlockQuantScaleParameter

Parameter class for weight scales loaded for weights with block-wise quantization. Uses both column and row parallelism.

ChannelQuantScaleParameter

Parameter class for weight scales loaded for weights with channel-wise quantization. Equivalent to _ColumnvLLMParameter.

GroupQuantScaleParameter

Parameter class for weight scales loaded for weights with grouped quantization. Uses both column and row parallelism.

ModelWeightParameter

Parameter class for linear layer weights. Uses both column and row parallelism.

PackedColumnParameter

Parameter for model parameters which are packed on disk and support column parallelism only. See PackedvLLMParameter for more details on the packed properties.

PackedvLLMParameter

Parameter for model weights which are packed on disk. Example: GPTQ Marlin weights are int4 or int8, packed into int32. Extends the ModelWeightParameter to take in the packed factor, the packed dimension, and optionally, marlin tile size for marlin kernels. Adjusts the shard_size and shard_offset for fused linear layers model weight loading by accounting for packing and optionally, marlin tile size.

PerTensorScaleParameter

Parameter class for scales where the number of scales is equivalent to the number of logical matrices in fused linear layers (e.g. for QKV, there are 3 scales loaded from disk). This is relevant to weights with per-tensor quantization. Adds functionality to map the scalers to a shard during weight loading.

RowvLLMParameter

Parameter class defining weight_loading functionality (load_row_parallel_weight) for parameters being loaded into linear layers with row parallel functionality. Requires an input_dim to be defined.

Functions#

permute_param_layout_

Permute a parameter’s layout to the specified input and output dimensions, useful for forcing the parameter into a known layout, for example, if I need a packed (quantized) weight matrix to be in the layout {input_dim = 0, output_dim = 1, packed_dim = 0} then I can call: permute_param_layout_(x, input_dim=0, output_dim=1, packed_dim=0) to ensure x is in the correct layout (permuting it to the correct layout if required, asserting if it cannot get it to the correct layout)

Data#

API#

class fastvideo.v1.models.parameter.BasevLLMParameter(data: torch.Tensor, weight_loader: Callable)[source]#

Bases: torch.nn.Parameter

Base parameter for vLLM linear layers. Extends the torch.nn.parameter by taking in a linear weight loader. Will copy the loaded weight into the parameter when the provided weight loader is called.

Initialization

Initialize the BasevLLMParameter

Parameters:
  • data – torch tensor with the parameter data

  • weight_loader – weight loader callable

Returns:

a torch.nn.parameter

load_column_parallel_weight(loaded_weight: torch.Tensor) None[source]#
load_merged_column_weight(loaded_weight: torch.Tensor, **kwargs) None[source]#
load_qkv_weight(loaded_weight: torch.Tensor, **kwargs) None[source]#
load_row_parallel_weight(loaded_weight: torch.Tensor) None[source]#
property weight_loader[source]#
class fastvideo.v1.models.parameter.BlockQuantScaleParameter(output_dim: int, **kwargs)[source]#

Bases: fastvideo.v1.models.parameter._ColumnvLLMParameter, fastvideo.v1.models.parameter.RowvLLMParameter

Parameter class for weight scales loaded for weights with block-wise quantization. Uses both column and row parallelism.

Initialization

Initialize the BasevLLMParameter

Parameters:
  • data – torch tensor with the parameter data

  • weight_loader – weight loader callable

Returns:

a torch.nn.parameter

class fastvideo.v1.models.parameter.ChannelQuantScaleParameter(output_dim: int, **kwargs)[source]#

Bases: fastvideo.v1.models.parameter._ColumnvLLMParameter

Parameter class for weight scales loaded for weights with channel-wise quantization. Equivalent to _ColumnvLLMParameter.

Initialization

Initialize the BasevLLMParameter

Parameters:
  • data – torch tensor with the parameter data

  • weight_loader – weight loader callable

Returns:

a torch.nn.parameter

class fastvideo.v1.models.parameter.GroupQuantScaleParameter(output_dim: int, **kwargs)[source]#

Bases: fastvideo.v1.models.parameter._ColumnvLLMParameter, fastvideo.v1.models.parameter.RowvLLMParameter

Parameter class for weight scales loaded for weights with grouped quantization. Uses both column and row parallelism.

Initialization

Initialize the BasevLLMParameter

Parameters:
  • data – torch tensor with the parameter data

  • weight_loader – weight loader callable

Returns:

a torch.nn.parameter

class fastvideo.v1.models.parameter.ModelWeightParameter(output_dim: int, **kwargs)[source]#

Bases: fastvideo.v1.models.parameter._ColumnvLLMParameter, fastvideo.v1.models.parameter.RowvLLMParameter

Parameter class for linear layer weights. Uses both column and row parallelism.

Initialization

Initialize the BasevLLMParameter

Parameters:
  • data – torch tensor with the parameter data

  • weight_loader – weight loader callable

Returns:

a torch.nn.parameter

class fastvideo.v1.models.parameter.PackedColumnParameter(packed_factor: Union[int, fractions.Fraction], packed_dim: int, **kwargs)[source]#

Bases: fastvideo.v1.models.parameter._ColumnvLLMParameter

Parameter for model parameters which are packed on disk and support column parallelism only. See PackedvLLMParameter for more details on the packed properties.

Initialization

Initialize the BasevLLMParameter

Parameters:
  • data – torch tensor with the parameter data

  • weight_loader – weight loader callable

Returns:

a torch.nn.parameter

adjust_shard_indexes_for_packing(shard_size, shard_offset) Tuple[Any, Any][source]#
property packed_dim[source]#
property packed_factor[source]#
class fastvideo.v1.models.parameter.PackedvLLMParameter(packed_factor: Union[int, fractions.Fraction], packed_dim: int, **kwargs)[source]#

Bases: fastvideo.v1.models.parameter.ModelWeightParameter

Parameter for model weights which are packed on disk. Example: GPTQ Marlin weights are int4 or int8, packed into int32. Extends the ModelWeightParameter to take in the packed factor, the packed dimension, and optionally, marlin tile size for marlin kernels. Adjusts the shard_size and shard_offset for fused linear layers model weight loading by accounting for packing and optionally, marlin tile size.

Initialization

Initialize the BasevLLMParameter

Parameters:
  • data – torch tensor with the parameter data

  • weight_loader – weight loader callable

Returns:

a torch.nn.parameter

adjust_shard_indexes_for_packing(shard_size, shard_offset)[source]#
property packed_dim[source]#
property packed_factor[source]#
class fastvideo.v1.models.parameter.PerTensorScaleParameter(**kwargs)[source]#

Bases: fastvideo.v1.models.parameter.BasevLLMParameter

Parameter class for scales where the number of scales is equivalent to the number of logical matrices in fused linear layers (e.g. for QKV, there are 3 scales loaded from disk). This is relevant to weights with per-tensor quantization. Adds functionality to map the scalers to a shard during weight loading.

Note: additional parameter manipulation may be handled for each quantization config specifically, within process_weights_after_loading

Initialization

Initialize the BasevLLMParameter

Parameters:
  • data – torch tensor with the parameter data

  • weight_loader – weight loader callable

Returns:

a torch.nn.parameter

load_column_parallel_weight(*args, **kwargs) None[source]#
load_merged_column_weight(*args, **kwargs) None[source]#
load_qkv_weight(*args, **kwargs) None[source]#
load_row_parallel_weight(*args, **kwargs) None[source]#
class fastvideo.v1.models.parameter.RowvLLMParameter(input_dim: int, **kwargs)[source]#

Bases: fastvideo.v1.models.parameter.BasevLLMParameter

Parameter class defining weight_loading functionality (load_row_parallel_weight) for parameters being loaded into linear layers with row parallel functionality. Requires an input_dim to be defined.

Initialization

Initialize the BasevLLMParameter

Parameters:
  • data – torch tensor with the parameter data

  • weight_loader – weight loader callable

Returns:

a torch.nn.parameter

property input_dim[source]#
load_row_parallel_weight(loaded_weight: torch.Tensor) None[source]#
fastvideo.v1.models.parameter.logger[source]#

β€˜init_logger(…)’

fastvideo.v1.models.parameter.permute_param_layout_(param: fastvideo.v1.models.parameter.BasevLLMParameter, input_dim: int, output_dim: int, **kwargs) fastvideo.v1.models.parameter.BasevLLMParameter[source]#

Permute a parameter’s layout to the specified input and output dimensions, useful for forcing the parameter into a known layout, for example, if I need a packed (quantized) weight matrix to be in the layout {input_dim = 0, output_dim = 1, packed_dim = 0} then I can call: permute_param_layout_(x, input_dim=0, output_dim=1, packed_dim=0) to ensure x is in the correct layout (permuting it to the correct layout if required, asserting if it cannot get it to the correct layout)