fastvideo.v1.layers.rotary_embedding#

Rotary Positional Embeddings.

Module Contents#

Classes#

RotaryEmbedding

Original rotary positional embedding.

Functions#

get_1d_rotary_pos_embed

Precompute the frequency tensor for complex exponential (cis) with given dimensions. (Note: cis means cos + i * sin, where i is the imaginary unit.)

get_meshgrid_nd

Get n-D meshgrid with start, stop and num.

get_nd_rotary_pos_embed

This is a n-d version of precompute_freqs_cis, which is a RoPE for tokens with n-d structure. Supports sequence parallelism by allowing sharding of a specific dimension.

get_rope

get_rotary_pos_embed

Generate rotary positional embeddings for the given sizes.

API#

class fastvideo.v1.layers.rotary_embedding.RotaryEmbedding(head_size: int, rotary_dim: int, max_position_embeddings: int, base: Union[int, float], is_neox_style: bool, dtype: torch.dtype)[source]#

Bases: fastvideo.v1.layers.custom_op.CustomOp

Original rotary positional embedding.

Initialization

Initialize internal Module state, shared by both nn.Module and ScriptModule.

extra_repr() str[source]#
forward_native(positions: torch.Tensor, query: torch.Tensor, key: torch.Tensor, offsets: Optional[torch.Tensor] = None) Tuple[torch.Tensor, torch.Tensor][source]#

A PyTorch-native implementation of forward().

fastvideo.v1.layers.rotary_embedding.get_1d_rotary_pos_embed(dim: int, pos: Union[torch.FloatTensor, int], theta: float = 10000.0, theta_rescale_factor: float = 1.0, interpolation_factor: float = 1.0, dtype: torch.dtype = torch.float32) Tuple[torch.Tensor, torch.Tensor][source]#

Precompute the frequency tensor for complex exponential (cis) with given dimensions. (Note: cis means cos + i * sin, where i is the imaginary unit.)

This function calculates a frequency tensor with complex exponential using the given dimension β€˜dim’ and the end index β€˜end’. The β€˜theta’ parameter scales the frequencies.

Parameters:
  • dim (int) – Dimension of the frequency tensor.

  • pos (int or torch.FloatTensor) – Position indices for the frequency tensor. [S] or scalar

  • theta (float, optional) – Scaling factor for frequency computation. Defaults to 10000.0.

  • theta_rescale_factor (float, optional) – Rescale factor for theta. Defaults to 1.0.

  • interpolation_factor (float, optional) – Factor to scale positions. Defaults to 1.0.

Returns:

Precomputed frequency tensor with real and imaginary parts separately. [S, D]

Return type:

freqs_cos, freqs_sin

fastvideo.v1.layers.rotary_embedding.get_meshgrid_nd(start: Union[int, Tuple[int, ...]], *args: Union[int, Tuple[int, ...]], dim: int = 2) torch.Tensor[source]#

Get n-D meshgrid with start, stop and num.

Parameters:
  • start (int or tuple) – If len(args) == 0, start is num; If len(args) == 1, start is start, args[0] is stop, step is 1; If len(args) == 2, start is start, args[0] is stop, args[1] is num. For n-dim, start/stop/num should be int or n-tuple. If n-tuple is provided, the meshgrid will be stacked following the dim order in n-tuples.

  • *args – See above.

  • dim (int) – Dimension of the meshgrid. Defaults to 2.

Returns:

[dim, …]

Return type:

grid (np.ndarray)

fastvideo.v1.layers.rotary_embedding.get_nd_rotary_pos_embed(rope_dim_list, start, *args, theta=10000.0, theta_rescale_factor: Union[float, List[float]] = 1.0, interpolation_factor: Union[float, List[float]] = 1.0, shard_dim: int = 0, sp_rank: int = 0, sp_world_size: int = 1, dtype: torch.dtype = torch.float32) Tuple[torch.Tensor, torch.Tensor][source]#

This is a n-d version of precompute_freqs_cis, which is a RoPE for tokens with n-d structure. Supports sequence parallelism by allowing sharding of a specific dimension.

Parameters:
  • rope_dim_list (list of int) – Dimension of each rope. len(rope_dim_list) should equal to n. sum(rope_dim_list) should equal to head_dim of attention layer.

  • start (int | tuple of int | list of int) – If len(args) == 0, start is num; If len(args) == 1, start is start, args[0] is stop, step is 1; If len(args) == 2, start is start, args[0] is stop, args[1] is num.

  • *args – See above.

  • theta (float) – Scaling factor for frequency computation. Defaults to 10000.0.

  • theta_rescale_factor (float) – Rescale factor for theta. Defaults to 1.0.

  • interpolation_factor (float) – Factor to scale positions. Defaults to 1.0.

  • shard_dim (int) – Which dimension to shard for sequence parallelism. Defaults to 0.

  • sp_rank (int) – Rank in the sequence parallel group. Defaults to 0.

  • sp_world_size (int) – World size of the sequence parallel group. Defaults to 1.

Returns:

(cos, sin) tensors of shape [HW, D/2]

Return type:

Tuple[torch.Tensor, torch.Tensor]

fastvideo.v1.layers.rotary_embedding.get_rope(head_size: int, rotary_dim: int, max_position: int, base: Union[int, float], is_neox_style: bool = True, rope_scaling: Optional[Dict[str, Any]] = None, dtype: Optional[torch.dtype] = None, partial_rotary_factor: float = 1.0) fastvideo.v1.layers.rotary_embedding.RotaryEmbedding[source]#
fastvideo.v1.layers.rotary_embedding.get_rotary_pos_embed(rope_sizes, hidden_size, heads_num, rope_dim_list, rope_theta, theta_rescale_factor=1.0, interpolation_factor=1.0, shard_dim: int = 0, dtype: torch.dtype = torch.float32) Tuple[torch.Tensor, torch.Tensor][source]#

Generate rotary positional embeddings for the given sizes.

Parameters:
  • rope_sizes – Tuple of dimensions (t, h, w)

  • hidden_size – Hidden dimension size

  • heads_num – Number of attention heads

  • rope_dim_list – List of dimensions for each axis, or None

  • rope_theta – Base for frequency calculations

  • theta_rescale_factor – Rescale factor for theta. Defaults to 1.0

  • interpolation_factor – Factor to scale positions. Defaults to 1.0

  • shard_dim – Which dimension to shard for sequence parallelism. Defaults to 0.

Returns:

Tuple of (cos, sin) tensors for rotary embeddings