layers
¶
Modules¶
fastvideo.layers.activation
¶
Custom activation functions.
Classes¶
fastvideo.layers.activation.GeluAndMul
¶
GeluAndMul(approximate: str = 'none')
Bases: CustomOp
An activation function for GeGLU.
The function computes x -> GELU(x[:d]) * x[d:] where d = x.shape[-1] // 2.
Shapes
x: (batch_size, seq_len, 2 * d) or (num_tokens, 2 * d) return: (batch_size, seq_len, d) or (num_tokens, d)
Source code in fastvideo/layers/activation.py
fastvideo.layers.activation.NewGELU
¶
fastvideo.layers.activation.QuickGELU
¶
fastvideo.layers.activation.SiluAndMul
¶
Bases: CustomOp
An activation function for SwiGLU.
The function computes x -> silu(x[:d]) * x[d:] where d = x.shape[-1] // 2.
Shapes
x: (num_tokens, 2 * d) or (batch_size, seq_len, 2 * d) return: (num_tokens, d) or (batch_size, seq_len, d)
Functions¶
fastvideo.layers.activation.get_act_and_mul_fn
¶
get_act_and_mul_fn(act_fn_name: str) -> Module
Get an activation-and-mul (i.e. SiluAndMul) function by name.
Source code in fastvideo/layers/activation.py
fastvideo.layers.custom_op
¶
Classes¶
fastvideo.layers.custom_op.CustomOp
¶
Bases: Module
Base class for custom ops. Dispatches the forward method to the appropriate backend.
Source code in fastvideo/layers/custom_op.py
Functions¶
fastvideo.layers.custom_op.CustomOp.default_on
staticmethod
¶default_on() -> bool
On by default if level < CompilationLevel.PIECEWISE Specifying 'all' or 'none' in custom_op takes precedence.
fastvideo.layers.custom_op.CustomOp.forward_native
¶forward_native(*args, **kwargs) -> Any
PyTorch-native implementation of the forward method. This method is optional. If implemented, it can be used with compilers such as torch.compile or PyTorch XLA. Also, it can be used for testing purposes.
Source code in fastvideo/layers/custom_op.py
Functions¶
fastvideo.layers.layernorm
¶
Custom normalization layers.
Classes¶
fastvideo.layers.layernorm.LayerNormScaleShift
¶
LayerNormScaleShift(hidden_size: int, norm_type: str = 'rms', eps: float = 1e-06, elementwise_affine: bool = False, dtype: dtype = float32, compute_dtype: dtype | None = None, prefix: str = '')
Bases: Module
Fused operation that combines LayerNorm with scale and shift operations. This reduces memory bandwidth by combining memory-bound operations.
Source code in fastvideo/layers/layernorm.py
Functions¶
fastvideo.layers.layernorm.LayerNormScaleShift.forward
¶Apply ln followed by scale and shift in a single fused operation.
Source code in fastvideo/layers/layernorm.py
fastvideo.layers.layernorm.RMSNorm
¶
RMSNorm(hidden_size: int, eps: float = 1e-06, dtype: dtype = float32, var_hidden_size: int | None = None, has_weight: bool = True)
Bases: CustomOp
Root mean square normalization.
Computes x -> w * x / sqrt(E[x^2] + eps) where w is the learned weight. Refer to https://arxiv.org/abs/1910.07467
Source code in fastvideo/layers/layernorm.py
Functions¶
fastvideo.layers.layernorm.RMSNorm.forward_native
¶forward_native(x: Tensor, residual: Tensor | None = None) -> Tensor | tuple[Tensor, Tensor]
PyTorch-native implementation equivalent to forward().
Source code in fastvideo/layers/layernorm.py
fastvideo.layers.layernorm.ScaleResidual
¶
ScaleResidual(prefix: str = '')
Bases: Module
Applies gated residual connection.
Source code in fastvideo/layers/layernorm.py
Functions¶
fastvideo.layers.layernorm.ScaleResidual.forward
¶Apply gated residual connection.
Source code in fastvideo/layers/layernorm.py
fastvideo.layers.layernorm.ScaleResidualLayerNormScaleShift
¶
ScaleResidualLayerNormScaleShift(hidden_size: int, norm_type: str = 'rms', eps: float = 1e-06, elementwise_affine: bool = False, dtype: dtype = float32, compute_dtype: dtype | None = None, prefix: str = '')
Bases: Module
Fused operation that combines: 1. Gated residual connection 2. LayerNorm 3. Scale and shift operations
This reduces memory bandwidth by combining memory-bound operations.
Source code in fastvideo/layers/layernorm.py
Functions¶
fastvideo.layers.layernorm.ScaleResidualLayerNormScaleShift.forward
¶forward(residual: Tensor, x: Tensor, gate: Tensor | int, shift: Tensor, scale: Tensor) -> tuple[Tensor, Tensor]
Apply gated residual connection, followed by layernorm and scale/shift in a single fused operation.
Returns:
| Type | Description |
|---|---|
Tensor
|
Tuple containing: |
Tensor
|
|
tuple[Tensor, Tensor]
|
|
Source code in fastvideo/layers/layernorm.py
fastvideo.layers.linear
¶
Classes¶
fastvideo.layers.linear.ColumnParallelLinear
¶
ColumnParallelLinear(input_size: int, output_size: int, bias: bool = True, gather_output: bool = False, skip_bias_add: bool = False, params_dtype: dtype | None = None, quant_config: QuantizationConfig | None = None, output_sizes: list[int] | None = None, prefix: str = '')
Bases: LinearBase
Linear layer with column parallelism.
The linear layer is defined as Y = XA + b. A is parallelized along its second dimension as A = [A_1, ..., A_p].
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_size
|
int
|
first dimension of matrix A. |
required |
output_size
|
int
|
second dimension of matrix A. |
required |
bias
|
bool
|
If true, add bias. |
True
|
gather_output
|
bool
|
If true, call all-gather on output and make Y available to all GPUs, otherwise, every GPU will have its output which is Y_i = XA_i |
False
|
skip_bias_add
|
bool
|
This was added to enable performance optimizations where bias can be fused with other element-wise operations. we skip adding bias but instead return it. |
False
|
params_dtype
|
dtype | None
|
Data type for the parameters. |
None
|
quant_config
|
QuantizationConfig | None
|
Quantization configure. |
None
|
output_sizes
|
list[int] | None
|
list of output sizes packed into one output, like for QKV the list would be size 3. |
None
|
prefix
|
str
|
The name of the layer in the state dict, including all parents (e.g. model.layers.0.qkv_proj) |
''
|
Source code in fastvideo/layers/linear.py
fastvideo.layers.linear.LinearBase
¶
LinearBase(input_size: int, output_size: int, skip_bias_add: bool = False, params_dtype: dtype | None = None, quant_config: QuantizationConfig | None = None, prefix: str = '')
Bases: Module
Base linear layer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_size
|
int
|
input dimension of the linear layer. |
required |
output_size
|
int
|
output dimension of the linear layer. |
required |
bias
|
If true, add bias. |
required | |
skip_bias_add
|
bool
|
If true, skip adding bias but instead return it. |
False
|
params_dtype
|
dtype | None
|
Data type for the parameters. |
None
|
quant_config
|
QuantizationConfig | None
|
Quantization configure. |
None
|
Source code in fastvideo/layers/linear.py
fastvideo.layers.linear.LinearMethodBase
¶
Bases: QuantizeMethodBase
Base class for different (maybe quantized) linear methods.
Functions¶
fastvideo.layers.linear.LinearMethodBase.apply
abstractmethod
¶Apply the weights in layer to the input tensor. Expects create_weights to have been called before on the layer.
Source code in fastvideo/layers/linear.py
fastvideo.layers.linear.LinearMethodBase.create_weights
abstractmethod
¶create_weights(layer: Module, input_size_per_partition: int, output_partition_sizes: list[int], input_size: int, output_size: int, params_dtype: dtype, **extra_weight_attrs) -> None
Create weights for a linear layer. The weights will be set as attributes of the layer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
layer
|
Module
|
The layer that is using the LinearMethodBase factory. |
required |
input_size_per_partition
|
int
|
Size of the weight input dim on rank X. |
required |
output_partition_sizes
|
list[int]
|
Sizes of the output dim of each logical weight on rank X. E.g., output_partition_sizes for QKVLinear is a list contains the width of Wq, Wk, Wv on rank X. |
required |
input_size
|
int
|
Size of the input dim of the weight across all ranks. |
required |
output_size
|
int
|
Size of the output dim of the weight across all ranks. |
required |
params_dtype
|
dtype
|
Datatype of the parameters. |
required |
Source code in fastvideo/layers/linear.py
fastvideo.layers.linear.MergedColumnParallelLinear
¶
MergedColumnParallelLinear(input_size: int, output_sizes: list[int], bias: bool = True, gather_output: bool = False, skip_bias_add: bool = False, params_dtype: dtype | None = None, quant_config: QuantizationConfig | None = None, prefix: str = '')
Bases: ColumnParallelLinear
Packed linear layers with column parallelism.
Similar to ColumnParallelLinear, but the weight matrix is concatenated along the output dimension. When the weight matrix is loaded, the different partitions are sharded separately.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_size
|
int
|
input dimension of the linear layer. |
required |
output_sizes
|
list[int]
|
list of output dimensions of the linear layer. |
required |
bias
|
bool
|
If true, add bias. |
True
|
gather_output
|
bool
|
If true, call all-gather on output and make the output available to all GPUs, otherwise, every GPU will have its own output. |
False
|
skip_bias_add
|
bool
|
This was added to enable performance optimizations where bias can be fused with other element-wise operations. we skip adding bias but instead return it. |
False
|
params_dtype
|
dtype | None
|
Data type for the parameters. |
None
|
quant_config
|
QuantizationConfig | None
|
Quantization configure. |
None
|
prefix
|
str
|
The name of the layer in the state dict, including all parents (e.g. model.layers.0.qkv_proj) |
''
|
Source code in fastvideo/layers/linear.py
fastvideo.layers.linear.QKVParallelLinear
¶
QKVParallelLinear(hidden_size: int, head_size: int, total_num_heads: int, total_num_kv_heads: int | None = None, bias: bool = True, skip_bias_add: bool = False, params_dtype: dtype | None = None, quant_config: QuantizationConfig | None = None, prefix: str = '')
Bases: ColumnParallelLinear
Linear layers for the attention's QKV transformation.
Linear layers for the linear transformation of the query, key, and value vectors in the attention layer. The weight matrix is concatenated along the output dimension. The layer is parallelized along the head dimension. When the number of key/value heads is smaller than the number of query heads (e.g., multi-query/grouped-query attention), the key/value head may be replicated while the query heads are partitioned.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_size
|
int
|
input hidden state size of the transformer. |
required |
head_size
|
int
|
size of each attention head. |
required |
total_num_heads
|
int
|
total number of attention query heads. |
required |
total_num_kv_heads
|
int | None
|
total number of attention key/value heads. If None, assume total_num_kv_heads = total_num_heads. |
None
|
bias
|
bool
|
If true, add bias. |
True
|
skip_bias_add
|
bool
|
This was added to enable performance optimizations where bias can be fused with other element-wise operations. we skip adding bias but instead return it. |
False
|
params_dtype
|
dtype | None
|
Data type for the parameters. |
None
|
quant_config
|
QuantizationConfig | None
|
Quantization configure. |
None
|
prefix
|
str
|
The name of the layer in the state dict, including all parents (e.g. model.layers.0.qkv_proj) |
''
|
Source code in fastvideo/layers/linear.py
fastvideo.layers.linear.ReplicatedLinear
¶
ReplicatedLinear(input_size: int, output_size: int, bias: bool = True, skip_bias_add: bool = False, params_dtype: dtype | None = None, quant_config: QuantizationConfig | None = None, prefix: str = '')
Bases: LinearBase
Replicated linear layer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_size
|
int
|
input dimension of the linear layer. |
required |
output_size
|
int
|
output dimension of the linear layer. |
required |
bias
|
bool
|
If true, add bias. |
True
|
skip_bias_add
|
bool
|
If true, skip adding bias but instead return it. |
False
|
params_dtype
|
dtype | None
|
Data type for the parameters. |
None
|
quant_config
|
QuantizationConfig | None
|
Quantization configure. |
None
|
prefix
|
str
|
The name of the layer in the state dict, including all parents (e.g. model.layers.0.qkv_proj) |
''
|
Source code in fastvideo/layers/linear.py
fastvideo.layers.linear.RowParallelLinear
¶
RowParallelLinear(input_size: int, output_size: int, bias: bool = True, input_is_parallel: bool = True, skip_bias_add: bool = False, params_dtype: dtype | None = None, reduce_results: bool = True, quant_config: QuantizationConfig | None = None, prefix: str = '')
Bases: LinearBase
Linear layer with row parallelism.
The linear layer is defined as Y = XA + b. A is parallelized along its first dimension and X along its second dimension as: - - | A_1 | | . | A = | . | X = [X_1, ..., X_p] | . | | A_p | - - Arguments: input_size: first dimension of matrix A. output_size: second dimension of matrix A. bias: If true, add bias. Note that bias is not parallelized. input_is_parallel: If true, we assume that the input is already split across the GPUs and we do not split again. skip_bias_add: This was added to enable performance optimization where bias can be fused with other element-wise operations. We skip adding bias but instead return it. params_dtype: Data type for the parameters. quant_config: Quantization configure.
Source code in fastvideo/layers/linear.py
fastvideo.layers.linear.UnquantizedLinearMethod
¶
Functions¶
fastvideo.layers.linear.adjust_scalar_to_fused_array
¶
adjust_scalar_to_fused_array(param: Tensor, loaded_weight: Tensor, shard_id: str | int) -> tuple[Tensor, Tensor]
For fused modules (QKV and MLP) we have an array of length N that holds 1 scale for each "logical" matrix. So the param is an array of length N. The loaded_weight corresponds to one of the shards on disk. Here, we slice the param based on the shard_id for loading.
Source code in fastvideo/layers/linear.py
fastvideo.layers.mlp
¶
Classes¶
fastvideo.layers.mlp.MLP
¶
MLP(input_dim: int, mlp_hidden_dim: int, output_dim: int | None = None, bias: bool = True, act_type: str = 'gelu_pytorch_tanh', dtype: dtype | None = None, prefix: str = '')
Bases: Module
MLP for DiT blocks, NO gated linear units
Source code in fastvideo/layers/mlp.py
Functions¶
fastvideo.layers.quantization
¶
Classes¶
Functions¶
fastvideo.layers.quantization.register_quantization_config
¶
register_quantization_config(quantization: str)
Register a customized vllm quantization config.
When a quantization method is not supported by vllm, you can register a customized quantization config to support it.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
quantization
|
str
|
The quantization method name. |
required |
Examples:
>>> from fastvideo.layers.quantization import register_quantization_config
>>> from fastvideo.layers.quantization import get_quantization_config
>>> from fastvideo.layers.quantization.base_config import QuantizationConfig
>>>
>>> @register_quantization_config("my_quant")
... class MyQuantConfig(QuantizationConfig):
... pass
>>>
>>> get_quantization_config("my_quant")
<class 'MyQuantConfig'>
Source code in fastvideo/layers/quantization/__init__.py
Modules¶
fastvideo.layers.quantization.base_config
¶
Classes¶
fastvideo.layers.quantization.base_config.QuantizationConfig
¶
Bases: ABC
Base class for quantization configs.
Source code in fastvideo/layers/quantization/base_config.py
fastvideo.layers.quantization.base_config.QuantizationConfig.from_config
abstractmethod
classmethod
¶from_config(config: dict[str, Any]) -> QuantizationConfig
Create a config class from the model's quantization config.
fastvideo.layers.quantization.base_config.QuantizationConfig.get_config_filenames
abstractmethod
staticmethod
¶ fastvideo.layers.quantization.base_config.QuantizationConfig.get_from_keys
staticmethod
¶Get a value from the model's quantization config.
Source code in fastvideo/layers/quantization/base_config.py
fastvideo.layers.quantization.base_config.QuantizationConfig.get_from_keys_or
staticmethod
¶Get a optional value from the model's quantization config.
Source code in fastvideo/layers/quantization/base_config.py
fastvideo.layers.quantization.base_config.QuantizationConfig.get_min_capability
abstractmethod
classmethod
¶get_min_capability() -> int
Minimum GPU capability to support the quantization method.
E.g., 70 for Volta, 75 for Turing, 80 for Ampere. This requirement is due to the custom CUDA kernels used by the quantization method.
Source code in fastvideo/layers/quantization/base_config.py
fastvideo.layers.quantization.base_config.QuantizationConfig.get_name
abstractmethod
¶ fastvideo.layers.quantization.base_config.QuantizationConfig.get_quant_method
abstractmethod
¶get_quant_method(layer: Module, prefix: str) -> QuantizeMethodBase | None
Get the quantize method to use for the quantized layer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
layer
|
Module
|
The layer for the quant method. |
required |
prefix
|
str
|
The full name of the layer in the state dict |
required |
Returns: The quantize method. None if the given layer doesn't support quant method.
Source code in fastvideo/layers/quantization/base_config.py
fastvideo.layers.quantization.base_config.QuantizationConfig.get_supported_act_dtypes
abstractmethod
¶get_supported_act_dtypes() -> list[dtype]
fastvideo.layers.quantization.base_config.QuantizationConfig.override_quantization_method
classmethod
¶Detects if this quantization method can support a given checkpoint format by overriding the user specified quantization method -- this method should only be overwritten by subclasses in exceptional circumstances
Source code in fastvideo/layers/quantization/base_config.py
fastvideo.layers.quantization.base_config.QuantizeMethodBase
¶
Bases: ABC
Base class for different quantized methods.
fastvideo.layers.quantization.base_config.QuantizeMethodBase.apply
abstractmethod
¶Apply the weights in layer to the input tensor.
Expects create_weights to have been called before on the layer.
Source code in fastvideo/layers/quantization/base_config.py
fastvideo.layers.quantization.base_config.QuantizeMethodBase.create_weights
abstractmethod
¶Create weights for a layer.
The weights will be set as attributes of the layer.
fastvideo.layers.quantization.base_config.QuantizeMethodBase.embedding
¶Gather embeddings in the layer based on indices in the input tensor.
Expects create_weights to have been called before on the layer.
Source code in fastvideo/layers/quantization/base_config.py
fastvideo.layers.quantization.base_config.QuantizeMethodBase.process_weights_after_loading
¶Process the weight after loading.
This can be used for example, to transpose weights for computation.
Functions¶
fastvideo.layers.quantization.base_config.method_has_implemented_embedding
¶method_has_implemented_embedding(method_class: type[QuantizeMethodBase]) -> bool
Not all quant methods have embedding implemented, so we need to check that it exists for our given method. We check this by making sure the function has been changed from the base implementation.
Source code in fastvideo/layers/quantization/base_config.py
fastvideo.layers.rotary_embedding
¶
Rotary Positional Embeddings.
Classes¶
fastvideo.layers.rotary_embedding.RotaryEmbedding
¶
RotaryEmbedding(head_size: int, rotary_dim: int, max_position_embeddings: int, base: int | float, is_neox_style: bool, dtype: dtype)
Bases: CustomOp
Original rotary positional embedding.
Source code in fastvideo/layers/rotary_embedding.py
Functions¶
fastvideo.layers.rotary_embedding.RotaryEmbedding.forward_native
¶forward_native(positions: Tensor, query: Tensor, key: Tensor, offsets: Tensor | None = None) -> tuple[Tensor, Tensor]
A PyTorch-native implementation of forward().
Source code in fastvideo/layers/rotary_embedding.py
Functions¶
fastvideo.layers.rotary_embedding.apply_rotary_emb
¶
apply_rotary_emb(x: Tensor, freqs_cis: Tensor | tuple[Tensor, Tensor], use_real: bool = True, use_real_unbind_dim: int = -1) -> Tensor
Apply rotary embeddings to input tensors using the given frequency tensor. This function applies rotary embeddings
to the given query or key 'x' tensors using the provided frequency tensor 'freqs_cis'. The input tensors are
reshaped as complex numbers, and the frequency tensor is reshaped for broadcasting compatibility. The resulting
tensors contain rotary embeddings and are returned as real tensors.
Args:
x (torch.Tensor):
Query or key tensor to apply rotary embeddings. [B, H, S, D] xk (torch.Tensor): Key tensor to apply
freqs_cis (Tuple[torch.Tensor]): Precomputed frequency tensor for complex exponentials. ([S, D], [S, D],)
Returns:
Tuple[torch.Tensor, torch.Tensor]: Tuple of modified query tensor and key tensor with rotary embeddings.
Source code in fastvideo/layers/rotary_embedding.py
fastvideo.layers.rotary_embedding.get_1d_rotary_pos_embed
¶
get_1d_rotary_pos_embed(dim: int, pos: FloatTensor | int, theta: float = 10000.0, theta_rescale_factor: float = 1.0, interpolation_factor: float = 1.0, dtype: dtype = float32) -> tuple[Tensor, Tensor]
Precompute the frequency tensor for complex exponential (cis) with given dimensions.
(Note: cis means cos + i * sin, where i is the imaginary unit.)
This function calculates a frequency tensor with complex exponential using the given dimension 'dim' and the end index 'end'. The 'theta' parameter scales the frequencies.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dim
|
int
|
Dimension of the frequency tensor. |
required |
pos
|
int or FloatTensor
|
Position indices for the frequency tensor. [S] or scalar |
required |
theta
|
float
|
Scaling factor for frequency computation. Defaults to 10000.0. |
10000.0
|
theta_rescale_factor
|
float
|
Rescale factor for theta. Defaults to 1.0. |
1.0
|
interpolation_factor
|
float
|
Factor to scale positions. Defaults to 1.0. |
1.0
|
Returns:
| Type | Description |
|---|---|
tuple[Tensor, Tensor]
|
freqs_cos, freqs_sin: Precomputed frequency tensor with real and imaginary parts separately. [S, D] |
Source code in fastvideo/layers/rotary_embedding.py
fastvideo.layers.rotary_embedding.get_meshgrid_nd
¶
Get n-D meshgrid with start, stop and num.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start
|
int or tuple
|
If len(args) == 0, start is num; If len(args) == 1, start is start, args[0] is stop, step is 1; If len(args) == 2, start is start, args[0] is stop, args[1] is num. For n-dim, start/stop/num should be int or n-tuple. If n-tuple is provided, the meshgrid will be stacked following the dim order in n-tuples. |
required |
*args
|
int | tuple[int, ...]
|
See above. |
()
|
dim
|
int
|
Dimension of the meshgrid. Defaults to 2. |
2
|
Returns:
| Name | Type | Description |
|---|---|---|
grid |
ndarray
|
[dim, ...] |
Source code in fastvideo/layers/rotary_embedding.py
fastvideo.layers.rotary_embedding.get_nd_rotary_pos_embed
¶
get_nd_rotary_pos_embed(rope_dim_list, start, *args, theta=10000.0, theta_rescale_factor: float | list[float] = 1.0, interpolation_factor: float | list[float] = 1.0, shard_dim: int = 0, sp_rank: int = 0, sp_world_size: int = 1, dtype: dtype = float32, start_frame: int = 0) -> tuple[Tensor, Tensor]
This is a n-d version of precompute_freqs_cis, which is a RoPE for tokens with n-d structure. Supports sequence parallelism by allowing sharding of a specific dimension.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rope_dim_list
|
list of int
|
Dimension of each rope. len(rope_dim_list) should equal to n. sum(rope_dim_list) should equal to head_dim of attention layer. |
required |
start
|
int | tuple of int | list of int
|
If len(args) == 0, start is num; If len(args) == 1, start is start, args[0] is stop, step is 1; If len(args) == 2, start is start, args[0] is stop, args[1] is num. |
required |
*args
|
See above. |
()
|
|
theta
|
float
|
Scaling factor for frequency computation. Defaults to 10000.0. |
10000.0
|
theta_rescale_factor
|
float
|
Rescale factor for theta. Defaults to 1.0. |
1.0
|
interpolation_factor
|
float
|
Factor to scale positions. Defaults to 1.0. |
1.0
|
shard_dim
|
int
|
Which dimension to shard for sequence parallelism. Defaults to 0. |
0
|
sp_rank
|
int
|
Rank in the sequence parallel group. Defaults to 0. |
0
|
sp_world_size
|
int
|
World size of the sequence parallel group. Defaults to 1. |
1
|
Returns:
| Type | Description |
|---|---|
tuple[Tensor, Tensor]
|
Tuple[torch.Tensor, torch.Tensor]: (cos, sin) tensors of shape [HW, D/2] |
Source code in fastvideo/layers/rotary_embedding.py
315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 | |
fastvideo.layers.rotary_embedding.get_rotary_pos_embed
¶
get_rotary_pos_embed(rope_sizes, hidden_size, heads_num, rope_dim_list, rope_theta, theta_rescale_factor=1.0, interpolation_factor=1.0, shard_dim: int = 0, dtype: dtype = float32, start_frame: int = 0) -> tuple[Tensor, Tensor]
Generate rotary positional embeddings for the given sizes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rope_sizes
|
Tuple of dimensions (t, h, w) |
required | |
hidden_size
|
Hidden dimension size |
required | |
heads_num
|
Number of attention heads |
required | |
rope_dim_list
|
List of dimensions for each axis, or None |
required | |
rope_theta
|
Base for frequency calculations |
required | |
theta_rescale_factor
|
Rescale factor for theta. Defaults to 1.0 |
1.0
|
|
interpolation_factor
|
Factor to scale positions. Defaults to 1.0 |
1.0
|
|
shard_dim
|
int
|
Which dimension to shard for sequence parallelism. Defaults to 0. |
0
|
Returns:
| Type | Description |
|---|---|
tuple[Tensor, Tensor]
|
Tuple of (cos, sin) tensors for rotary embeddings |
Source code in fastvideo/layers/rotary_embedding.py
fastvideo.layers.utils
¶
Utility methods for model layers.
fastvideo.layers.visual_embedding
¶
Classes¶
fastvideo.layers.visual_embedding.ModulateProjection
¶
ModulateProjection(hidden_size: int, factor: int = 2, act_layer: str = 'silu', dtype: dtype | None = None, prefix: str = '')
Bases: Module
Modulation layer for DiT blocks.
Source code in fastvideo/layers/visual_embedding.py
fastvideo.layers.visual_embedding.PatchEmbed
¶
PatchEmbed(patch_size=16, in_chans=3, embed_dim=768, norm_layer=None, flatten=True, bias=True, dtype=None, prefix: str = '')
Bases: Module
2D Image to Patch Embedding
Image to Patch Embedding using Conv2d
A convolution based approach to patchifying a 2D image w/ embedding projection.
Based on the impl in https://github.com/google-research/vision_transformer
Hacked together by / Copyright 2020 Ross Wightman
Remove the _assert function in forward function to be compatible with multi-resolution images.
Source code in fastvideo/layers/visual_embedding.py
fastvideo.layers.visual_embedding.TimestepEmbedder
¶
TimestepEmbedder(hidden_size, act_layer='silu', frequency_embedding_size=256, max_period=10000, dtype=None, freq_dtype=float32, prefix: str = '')
Bases: Module
Embeds scalar timesteps into vector representations.
Source code in fastvideo/layers/visual_embedding.py
Functions¶
fastvideo.layers.visual_embedding.get_timestep_embedding
¶
get_timestep_embedding(timesteps: Tensor, embedding_dim: int, flip_sin_to_cos: bool = False, downscale_freq_shift: float = 1, scale: float = 1, max_period: int = 10000) -> Tensor
This matches the implementation in Denoising Diffusion Probabilistic Models: Create sinusoidal timestep embeddings.
Args
timesteps (torch.Tensor):
a 1-D Tensor of N indices, one per batch element. These may be fractional.
embedding_dim (int):
the dimension of the output.
flip_sin_to_cos (bool):
Whether the embedding order should be cos, sin (if True) or sin, cos (if False)
downscale_freq_shift (float):
Controls the delta between frequencies between dimensions
scale (float):
Scaling factor applied to the embeddings.
max_period (int):
Controls the maximum frequency of the embeddings
Returns
torch.Tensor: an [N x dim] Tensor of positional embeddings.
Source code in fastvideo/layers/visual_embedding.py
fastvideo.layers.visual_embedding.timestep_embedding
¶
Create sinusoidal timestep embeddings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
t
|
Tensor
|
Tensor of shape [B] with timesteps |
required |
dim
|
int
|
Embedding dimension |
required |
max_period
|
int
|
Controls the minimum frequency of the embeddings |
10000
|
Returns:
| Type | Description |
|---|---|
Tensor
|
Tensor of shape [B, dim] with embeddings |
Source code in fastvideo/layers/visual_embedding.py
fastvideo.layers.visual_embedding.unpatchify
¶
Convert patched representation back to image space.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
Tensor of shape [B, THW, CP_tP_h*P_w] |
required | |
t, h, w
|
Temporal and spatial dimensions |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Unpatchified tensor of shape [B, C, TP_t, HP_h, W*P_w] |
Source code in fastvideo/layers/visual_embedding.py
fastvideo.layers.vocab_parallel_embedding
¶
Classes¶
fastvideo.layers.vocab_parallel_embedding.UnquantizedEmbeddingMethod
¶
Bases: QuantizeMethodBase
Unquantized method for embeddings.
Functions¶
fastvideo.layers.vocab_parallel_embedding.UnquantizedEmbeddingMethod.create_weights
¶create_weights(layer: Module, input_size_per_partition: int, output_partition_sizes: list[int], input_size: int, output_size: int, params_dtype: dtype, **extra_weight_attrs)
Create weights for embedding layer.
Source code in fastvideo/layers/vocab_parallel_embedding.py
fastvideo.layers.vocab_parallel_embedding.VocabParallelEmbedding
¶
VocabParallelEmbedding(num_embeddings: int, embedding_dim: int, params_dtype: dtype | None = None, org_num_embeddings: int | None = None, padding_size: int = DEFAULT_VOCAB_PADDING_SIZE, quant_config: QuantizationConfig | None = None, prefix: str = '')
Bases: Module
Embedding parallelized in the vocabulary dimension.
Adapted from torch.nn.Embedding, note that we pad the vocabulary size to make sure it is divisible by the number of model parallel GPUs.
In order to support various loading methods, we ensure that LoRA-added embeddings are always at the end of TP-sharded tensors. In other words, we shard base embeddings and LoRA embeddings separately (both padded), and place them in the same tensor. In this example, we will have the original vocab size = 1010, added vocab size = 16 and padding to 64. Therefore, the total vocab size with padding will be 1088 (because we first pad 1010 to 1024, add 16, and then pad to 1088). Therefore, the tensor format looks like the following: TP1, rank 0 (no sharding): |< --------BASE-------- >|< -BASE PADDING-- >|< -----LORA------ >|< -LORA PADDING-- >| corresponding token_id: | 0 | 1 | ... | 1009 | -1 | ... | -1 | 1010 | ... | 1015 | -1 | ... | -1 | index: | 0 | 1 | ... | 1009 | 1010 | ... | 1023 | 1024 | ... | 1039 | 1040 | ... | 1087 |
TP2, rank 0: |< --------------------BASE--------------------- >|< -----LORA------ >|< -LORA PADDING- >| corresponding token_id: | 0 | 1 | 2 | ... | 497 | 498 | ... | 511 | 1000 | ... | 1015 | -1 | ... | -1 | index: | 0 | 1 | 2 | ... | 497 | 498 | ... | 511 | 512 | ... | 527 | 520 | ... | 543 | TP2, rank 1: |< -----------BASE----------- >|< -BASE PADDING- >|< -----------LORA PADDING----------- >| corresponding token_id: | 512 | 513 | 514 | ... | 1009 | -1 | ... | -1 | -1 | ... | -1 | -1 | ... | -1 | index: | 0 | 1 | 2 | ... | 497 | 498 | ... | 511 | 512 | ... | 519 | 520 | ... | 543 |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_embeddings
|
int
|
vocabulary size. |
required |
embedding_dim
|
int
|
size of hidden state. |
required |
params_dtype
|
dtype | None
|
type of the parameters. |
None
|
org_num_embeddings
|
int | None
|
original vocabulary size (without LoRA). |
None
|
padding_size
|
int
|
padding size for the vocabulary. |
DEFAULT_VOCAB_PADDING_SIZE
|
quant_config
|
QuantizationConfig | None
|
quant config for the layer |
None
|
prefix
|
str
|
full name of the layer in the state dict |
''
|
Source code in fastvideo/layers/vocab_parallel_embedding.py
Functions¶
fastvideo.layers.vocab_parallel_embedding.VocabParallelEmbedding.get_sharded_to_full_mapping
¶Get a mapping that can be used to reindex the gathered logits for sampling.
During sampling, we gather logits from all ranks. The relationship of index->token_id will follow the same format as outlined in the class docstring. However, after the gather, we want to reindex the final logits tensor to map index->token_id one-to-one (the index is always equal the token_id it corresponds to). The indices returned by this method allow us to do that.
Source code in fastvideo/layers/vocab_parallel_embedding.py
fastvideo.layers.vocab_parallel_embedding.VocabParallelEmbeddingShardIndices
dataclass
¶
VocabParallelEmbeddingShardIndices(padded_org_vocab_start_index: int, padded_org_vocab_end_index: int, padded_added_vocab_start_index: int, padded_added_vocab_end_index: int, org_vocab_start_index: int, org_vocab_end_index: int, added_vocab_start_index: int, added_vocab_end_index: int)
Indices for a shard of a vocab parallel embedding.