fastvideo.v1.dataset.parquet_dataset_iterable_style#

Module Contents#

Classes#

BatchIterator

LatentsParquetIterStyleDataset

Efficient loader for video-text data from a directory of Parquet files.

Functions#

build_parquet_iterable_style_dataloader

Build a dataloader for the LatentsParquetIterStyleDataset.

shard_parquet_files_across_sp_groups_and_workers

Shard parquet files across SP groups and workers in a balanced way.

Data#

API#

class fastvideo.v1.dataset.parquet_dataset_iterable_style.BatchIterator(files, batch_size, text_padding_length, keys, worker_num_samples, read_batch_size)[source]#

Initialization

class fastvideo.v1.dataset.parquet_dataset_iterable_style.LatentsParquetIterStyleDataset(path: str, batch_size: int = 1024, cfg_rate: float = 0.1, num_workers: int = 1, drop_last: bool = True, text_padding_length: int = 512, seed: int = 42, read_batch_size: int = 32, parquet_schema: pyarrow.Schema = None)[source]#

Bases: torch.utils.data.IterableDataset

Efficient loader for video-text data from a directory of Parquet files.

Initialization

keys[source]#

[(‘vae_latent’, ‘latent’), ‘text_embedding’]

fastvideo.v1.dataset.parquet_dataset_iterable_style.build_parquet_iterable_style_dataloader(path: str, batch_size: int, num_data_workers: int, cfg_rate: float = 0.0, drop_last: bool = True, text_padding_length: int = 512, seed: int = 42, read_batch_size: int = 32) tuple[fastvideo.v1.dataset.parquet_dataset_iterable_style.LatentsParquetIterStyleDataset, torchdata.stateful_dataloader.StatefulDataLoader][source]#

Build a dataloader for the LatentsParquetIterStyleDataset.

fastvideo.v1.dataset.parquet_dataset_iterable_style.logger[source]#

‘init_logger(…)’

fastvideo.v1.dataset.parquet_dataset_iterable_style.shard_parquet_files_across_sp_groups_and_workers(path: str, num_sp_groups: int, num_workers: int, seed: int = 42) tuple[list[list[str]], list[int], list[dict[str, int]]][source]#

Shard parquet files across SP groups and workers in a balanced way.

Parameters:
  • path – Directory containing parquet files

  • num_sp_groups – Number of SP groups to shard across

  • num_workers – Number of workers per SP group

  • seed – Random seed for shuffling

Returns:

  • List of lists of parquet files for each shard

  • List of total samples per shard

  • List of dictionaries mapping file paths to their lengths

Return type:

Tuple containing