fastvideo.workflow.preprocess.components#

Module Contents#

Classes#

ParquetDatasetSaver

Component for saving and writing Parquet datasets using shared parquet_io.

PreprocessingDataValidator

VideoForwardBatchBuilder

Functions#

Data#

API#

class fastvideo.workflow.preprocess.components.ParquetDatasetSaver(flush_frequency: int, samples_per_file: int, schema: pyarrow.Schema, record_creator: collections.abc.Callable[..., list[dict[str, Any]]])[source]#

Component for saving and writing Parquet datasets using shared parquet_io.

Initialization

clean_up() None[source]#

Clean up all tables

flush_tables(write_remainder: bool = False)[source]#

Flush buffered records to disk.

Parameters:
  • output_dir – Directory where parquet files are written. Kept for API symmetry (writer already configured with this path).

  • write_remainder – If True, also write any leftover rows smaller than samples_per_file as a final small file. Useful for the last flush.

save_and_write_parquet_batch(batch: fastvideo.pipelines.pipeline_batch_info.PreprocessBatch, output_dir: str, extra_features: dict[str, Any] | None = None) None[source]#

Save and write Parquet dataset batch

Parameters:
  • batch – PreprocessBatch containing video and metadata information

  • output_dir – Output directory

  • extra_features – Extra features

Returns:

Number of processed samples

class fastvideo.workflow.preprocess.components.PreprocessingDataValidator(max_height: int = 1024, max_width: int = 1024, max_h_div_w_ratio: float = 17 / 16, min_h_div_w_ratio: float = 8 / 16, num_frames: int = 16, train_fps: int = 24, speed_factor: float = 1.0, video_length_tolerance_range: float = 5.0, drop_short_ratio: float = 0.0, hw_aspect_threshold: float = 1.5)[source]#

Initialization

add_validator(name: str, validator: collections.abc.Callable[[dict[str, Any]], bool]) None[source]#
log_validation_stats()[source]#
register_validators() None[source]#
class fastvideo.workflow.preprocess.components.VideoForwardBatchBuilder(seed: int)[source]#

Initialization

fastvideo.workflow.preprocess.components.build_dataset(preprocess_config: fastvideo.configs.configs.PreprocessConfig, split: str, validator: collections.abc.Callable[[dict[str, Any]], bool]) datasets.Dataset[source]#
fastvideo.workflow.preprocess.components.logger[source]#

β€˜init_logger(…)’