🧱 Data Preprocessing¶

To save GPU memory during training, FastVideo precomputes text embeddings and VAE latents. This eliminates the need to load the text encoder and VAE during training.

Quick Start¶

Download the sample dataset and run preprocessing:

# Download the crush-smol dataset
python scripts/huggingface/download_hf.py \
    --repo_id "wlsaidhi/crush-smol-merged" \
    --local_dir "data/crush-smol" \
    --repo_type "dataset"

# Run preprocessing
bash examples/training/finetune/wan_t2v_1.3B/crush_smol/preprocess_wan_data_t2v_new.sh

Preprocessing Pipeline¶

The new preprocessing pipeline supports multiple dataset formats and video loaders:

GPU_NUM=2
MODEL_PATH="Wan-AI/Wan2.1-T2V-1.3B-Diffusers"
DATASET_PATH="data/crush-smol/"
OUTPUT_DIR="data/crush-smol_processed_t2v/"

torchrun --nproc_per_node=$GPU_NUM \
    -m fastvideo.pipelines.preprocess.v1_preprocessing_new \
    --model_path $MODEL_PATH \
    --mode preprocess \
    --workload_type t2v \
    --preprocess.video_loader_type torchvision \
    --preprocess.dataset_type merged \
    --preprocess.dataset_path $DATASET_PATH \
    --preprocess.dataset_output_dir $OUTPUT_DIR \
    --preprocess.preprocess_video_batch_size 2 \
    --preprocess.dataloader_num_workers 0 \
    --preprocess.max_height 480 \
    --preprocess.max_width 832 \
    --preprocess.num_frames 77 \
    --preprocess.train_fps 16 \
    --preprocess.samples_per_file 8 \
    --preprocess.flush_frequency 8 \
    --preprocess.video_length_tolerance_range 5

Key Parameters¶

Parameter	Description
`--workload_type`	Task type: `t2v` (text-to-video) or `i2v` (image-to-video)
`--preprocess.dataset_type`	Input format: `hf` (HuggingFace) or `merged` (local folder)
`--preprocess.dataset_path`	Path to dataset (HF repo ID or local folder)
`--preprocess.dataset_output_dir`	Output directory for Parquet files
`--preprocess.video_loader_type`	Video decoder: `torchcodec` or `torchvision`
`--preprocess.max_height` / `max_width`	Target resolution for videos
`--preprocess.num_frames`	Number of frames to extract per video
`--preprocess.train_fps`	Target FPS for frame extraction

Dataset Formats¶

Merged Dataset (Local Folder)¶

Structure your dataset as follows:

your_dataset/
├── videos/
│   ├── video_001.mp4
│   ├── video_002.mp4
│   └── ...
└── videos2caption.json

The videos2caption.json maps video filenames to captions:

[
  {"path": "video_001.mp4", "cap": "A cat playing with yarn..."},
  {"path": "video_002.mp4", "cap": "Ocean waves at sunset..."}
]

HuggingFace Dataset¶

Use --preprocess.dataset_type hf and point --preprocess.dataset_path to a HuggingFace dataset with video and caption columns.

Creating Your Own Dataset¶

If you have raw videos and captions in separate files, generate the videos2caption.json:

python scripts/dataset_preparation/prepare_json_file.py \
    --data_folder path/to/your_raw_data/ \
    --output path/to/output_folder

Your raw data folder should contain:

your_raw_data/
├── videos/
│   ├── 0.mp4
│   ├── 1.mp4
│   └── ...
├── videos.txt    # list of video filenames
└── prompt.txt    # corresponding captions (one per line)

Output Format¶

Preprocessing outputs Parquet files in the combined_parquet_dataset/ subdirectory containing:

vae_latent_bytes — VAE-encoded video latent
text_embedding_bytes — text encoder output
clip_feature_bytes — CLIP image features (I2V only)
first_frame_latent_bytes — first frame latent (I2V only)
Metadata: shapes, dtypes, and sample identifiers

Examples¶

See ready-to-run preprocessing scripts in the training examples:

T2V: examples/training/finetune/wan_t2v_1.3B/crush_smol/preprocess_wan_data_t2v_new.sh
I2V: examples/training/finetune/wan_i2v_14B_480p/crush_smol/preprocess_wan_data_i2v_new.sh

→ Browse all training examples