fastvideo.v1.layers.vocab_parallel_embedding

`fastvideo.v1.layers.vocab_parallel_embedding`#

Module Contents#

Classes#

`UnquantizedEmbeddingMethod`	Unquantized method for embeddings.
`VocabParallelEmbedding`	Embedding parallelized in the vocabulary dimension.
`VocabParallelEmbeddingShardIndices`	Indices for a shard of a vocab parallel embedding.

Functions#

`get_masked_input_and_mask`
`pad_vocab_size`	Pad the vocab size to the given value.
`vocab_range_from_global_vocab_size`
`vocab_range_from_per_partition_vocab_size`

Data#

DEFAULT_VOCAB_PADDING_SIZE

API#

fastvideo.v1.layers.vocab_parallel_embedding.DEFAULT_VOCAB_PADDING_SIZE[source]#: 64

class fastvideo.v1.layers.vocab_parallel_embedding.UnquantizedEmbeddingMethod[source]#

Bases: fastvideo.v1.layers.quantization.base_config.QuantizeMethodBase

Unquantized method for embeddings.

apply(layer: torch.nn.Module, x: torch.Tensor, bias: torch.Tensor | None = None) → torch.Tensor[source]#

create_weights(layer: torch.nn.Module, input_size_per_partition: int, output_partition_sizes: list[int], input_size: int, output_size: int, params_dtype: torch.dtype, **extra_weight_attrs)[source]#: Create weights for embedding layer.

embedding(layer: torch.nn.Module, input_: torch.Tensor) → torch.Tensor[source]#

class fastvideo.v1.layers.vocab_parallel_embedding.VocabParallelEmbedding(num_embeddings: int, embedding_dim: int, params_dtype: torch.dtype | None = None, org_num_embeddings: int | None = None, padding_size: int = DEFAULT_VOCAB_PADDING_SIZE, quant_config: fastvideo.v1.layers.quantization.base_config.QuantizationConfig | None = None, prefix: str = '')[source]#

Bases: torch.nn.Module

Embedding parallelized in the vocabulary dimension.

Adapted from torch.nn.Embedding, note that we pad the vocabulary size to make sure it is divisible by the number of model parallel GPUs.

In order to support various loading methods, we ensure that LoRA-added embeddings are always at the end of TP-sharded tensors. In other words, we shard base embeddings and LoRA embeddings separately (both padded), and place them in the same tensor. In this example, we will have the original vocab size = 1010, added vocab size = 16 and padding to 64. Therefore, the total vocab size with padding will be 1088 (because we first pad 1010 to 1024, add 16, and then pad to 1088). Therefore, the tensor format looks like the following: TP1, rank 0 (no sharding): |< ——–BASE——– >|< -BASE PADDING– >|< —–LORA—— >|< -LORA PADDING– >| corresponding token_id: | 0 | 1 | … | 1009 | -1 | … | -1 | 1010 | … | 1015 | -1 | … | -1 | index: | 0 | 1 | … | 1009 | 1010 | … | 1023 | 1024 | … | 1039 | 1040 | … | 1087 |

TP2, rank 0: |< ——————–BASE——————— >|< —–LORA—— >|< -LORA PADDING- >| corresponding token_id: | 0 | 1 | 2 | … | 497 | 498 | … | 511 | 1000 | … | 1015 | -1 | … | -1 | index: | 0 | 1 | 2 | … | 497 | 498 | … | 511 | 512 | … | 527 | 520 | … | 543 | TP2, rank 1: |< ———–BASE———– >|< -BASE PADDING- >|< ———–LORA PADDING———– >| corresponding token_id: | 512 | 513 | 514 | … | 1009 | -1 | … | -1 | -1 | … | -1 | -1 | … | -1 | index: | 0 | 1 | 2 | … | 497 | 498 | … | 511 | 512 | … | 519 | 520 | … | 543 |

Parameters:

num_embeddings – vocabulary size.
embedding_dim – size of hidden state.
params_dtype – type of the parameters.
org_num_embeddings – original vocabulary size (without LoRA).
padding_size – padding size for the vocabulary.
quant_config – quant config for the layer
prefix – full name of the layer in the state dict

Initialization

Initialize internal Module state, shared by both nn.Module and ScriptModule.

extra_repr() → str[source]#

forward(input_)[source]#

get_sharded_to_full_mapping() → list[int] | None[source]#

Get a mapping that can be used to reindex the gathered logits for sampling.

During sampling, we gather logits from all ranks. The relationship of index->token_id will follow the same format as outlined in the class docstring. However, after the gather, we want to reindex the final logits tensor to map index->token_id one-to-one (the index is always equal the token_id it corresponds to). The indices returned by this method allow us to do that.

weight_loader(param: torch.nn.parameter.Parameter, loaded_weight: torch.Tensor)[source]#

class fastvideo.v1.layers.vocab_parallel_embedding.VocabParallelEmbeddingShardIndices[source]#

Indices for a shard of a vocab parallel embedding.

added_vocab_end_index: int[source]#: None

added_vocab_start_index: int[source]#: None

property num_added_elements: int[source]#

property num_added_elements_padded: int[source]#

property num_added_vocab_padding: int[source]#

property num_elements_padded: int[source]#

property num_org_elements: int[source]#

property num_org_elements_padded: int[source]#

property num_org_vocab_padding: int[source]#

org_vocab_end_index: int[source]#: None

org_vocab_start_index: int[source]#: None

padded_added_vocab_end_index: int[source]#: None

padded_added_vocab_start_index: int[source]#: None

padded_org_vocab_end_index: int[source]#: None

padded_org_vocab_start_index: int[source]#: None

fastvideo.v1.layers.vocab_parallel_embedding.get_masked_input_and_mask(input_: torch.Tensor, org_vocab_start_index: int, org_vocab_end_index: int, num_org_vocab_padding: int, added_vocab_start_index: int, added_vocab_end_index: int) → tuple[torch.Tensor, torch.Tensor][source]#

fastvideo.v1.layers.vocab_parallel_embedding.pad_vocab_size(vocab_size: int, pad_to: int = DEFAULT_VOCAB_PADDING_SIZE) → int[source]#: Pad the vocab size to the given value.

fastvideo.v1.layers.vocab_parallel_embedding.vocab_range_from_global_vocab_size(global_vocab_size: int, rank: int, world_size: int, offset: int = 0) → collections.abc.Sequence[int][source]#

fastvideo.v1.layers.vocab_parallel_embedding.vocab_range_from_per_partition_vocab_size(per_partition_vocab_size: int, rank: int, offset: int = 0) → collections.abc.Sequence[int][source]#