🧱 Data Preprocess for Distillation

Contents

🧱 Data Preprocess for Distillation#

For distillation, we use the same data preprocessing pipeline as training. Please refer to the Training Data Preprocess for general preprocessing steps.

Distillation-Specific Datasets#

FastVideo 480P Synthetic Wan Dataset#

For Wan2.1 T2V distillation, we use the FastVideo 480P Synthetic Wan dataset (FastVideo/Wan-Syn_77x448x832_600k) which contains 600k synthetic latents.

# Download the preprocessed dataset
python scripts/huggingface/download_hf.py \
    --repo_id "FastVideo/Wan-Syn_77x448x832_600k" \
    --local_dir "FastVideo/Wan-Syn_77x448x832_600k" \
    --repo_type "dataset"

Crush Smol Dataset#

For Wan2.2 TI2V distillation, we use the crush_smol dataset which includes both raw videos and preprocessed latents.

# Download dataset
python scripts/huggingface/download_hf.py \
    --repo_id=FastVideo/mini_i2v_dataset \
    --local_dir=data/mini_i2v_dataset \
    --repo_type=dataset

Preprocessing for Distillation#

The preprocessing steps are identical to training. Run the appropriate preprocessing script based on your model:

# For Wan2.1 T2V
bash scripts/preprocess/v1_preprocess_wan_data_t2v

# For Wan2.2 TI2V  
bash examples/distill/Wan2.2-TI2V-5B-Diffusers/crush_smol/preprocess_wan_data_ti2v_5b.sh