fastvideo.training.training_utils

`fastvideo.training.training_utils`#

Module Contents#

Classes#

`EMA_FSDP`	FSDP2-friendly EMA with two modes:
`SchedulerType`

Functions#

`clip_grad_norm_`	Clip the gradient norm of parameters.
`clip_grad_norm_while_handling_failing_dtensor_cases`
`compute_density_for_timestep_sampling`	Compute the density for sampling the timesteps when doing SD3 training.
`count_trainable`
`custom_to_hf_state_dict`	Convert fastvideo’s custom model format to diffusers format using reverse_param_names_mapping.
`gather_state_dict_on_cpu_rank0`
`get_constant_schedule`	Create a schedule with a constant learning rate, using the learning rate set in optimizer.
`get_constant_schedule_with_warmup`	Create a schedule with a constant learning rate preceded by a warmup period during which the learning rate increases linearly between 0 and the initial lr set in the optimizer.
`get_cosine_schedule_with_min_lr`	Create a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the optimizer to a minimum lr (min_lr_ratio * initial_lr), after a warmup period during which it increases linearly between 0 and the initial lr set in the optimizer.
`get_cosine_schedule_with_warmup`	Create a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the optimizer to 0, after a warmup period during which it increases linearly between 0 and the initial lr set in the optimizer.
`get_cosine_with_hard_restarts_schedule_with_warmup`	Create a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the optimizer to 0, with several hard restarts, after a warmup period during which it increases linearly between 0 and the initial lr set in the optimizer.
`get_linear_schedule_with_warmup`	Create a schedule with a learning rate that decreases linearly from the initial lr set in the optimizer to 0, after a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer.
`get_piecewise_constant_schedule`	Create a schedule with a constant learning rate, using the learning rate set in optimizer.
`get_polynomial_decay_schedule_with_warmup`	Create a schedule with a learning rate that decreases as a polynomial decay from the initial lr set in the optimizer to end lr defined by lr_end, after a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer.
`get_scheduler`	Unified API to get any scheduler from its name.
`get_sigmas`
`load_checkpoint`	Load checkpoint following finetrainer’s distributed checkpoint approach. Returns the step number from which training should resume.
`load_distillation_checkpoint`	Load distillation checkpoint with both generator and fake_score models. Supports MoE (Mixture of Experts) models with transformer_2 variants. Returns the step number from which training should resume.
`normalize_dit_input`
`save_checkpoint`	Save checkpoint following finetrainer’s distributed checkpoint approach. Saves both distributed checkpoint and consolidated model weights.
`save_distillation_checkpoint`	Save distillation checkpoint with both generator and fake_score models. Supports MoE (Mixture of Experts) models with transformer_2 variants. Saves both distributed checkpoint and consolidated model weights. Only saves the generator model for inference (consolidated weights).
`shard_latents_across_sp`
`shift_timestep`

Data#

`SchedulerFunction`
`TYPE_TO_SCHEDULER_FUNCTION`
`logger`

API#

class fastvideo.training.training_utils.EMA_FSDP(module, decay: float = 0.999, mode: str = 'local_shard')[source]#

FSDP2-friendly EMA with two modes:

mode=”local_shard” (default): maintain float32 CPU EMA of local parameter shards on every rank. Provides a context manager to temporarily swap EMA weights into the live model for teacher forward.
mode=”rank0_full”: maintain a consolidated float32 CPU EMA of full parameters on rank 0 only using gather_state_dict_on_cpu_rank0(). Useful for checkpoint export; not for teacher forward.

Usage (local_shard for CM teacher): ema = EMA_FSDP(model, decay=0.999, mode=”local_shard”) for step in …: ema.update(model) with ema.apply_to_model(model): with torch.no_grad(): y_teacher = model(…)

Usage (rank0_full for export): ema = EMA_FSDP(model, decay=0.999, mode=”rank0_full”) ema.update(model) ema.state_dict() # on rank 0

Initialization

apply_to_model(module)[source]#

copy_to_unwrapped(module) → None[source]#: Copy EMA weights into a non-sharded (unwrapped) module. Intended for export/eval. For mode=”rank0_full”, only rank 0 has the full EMA state.

load_state_dict(sd: dict[str, torch.Tensor])[source]#

state_dict() → dict[str, torch.Tensor][source]#

update(module)[source]#

fastvideo.training.training_utils.SchedulerFunction[source]#: None

class fastvideo.training.training_utils.SchedulerType[source]#

Bases: enum.Enum

CONSTANT[source]#: ‘constant’

CONSTANT_WITH_WARMUP[source]#: ‘constant_with_warmup’

COSINE[source]#: ‘cosine’

COSINE_WITH_MIN_LR[source]#: ‘cosine_with_min_lr’

COSINE_WITH_RESTARTS[source]#: ‘cosine_with_restarts’

LINEAR[source]#: ‘linear’

PIECEWISE_CONSTANT[source]#: ‘piecewise_constant’

POLYNOMIAL[source]#: ‘polynomial’

fastvideo.training.training_utils.TYPE_TO_SCHEDULER_FUNCTION: dict[fastvideo.training.training_utils.SchedulerType, fastvideo.training.training_utils.SchedulerFunction][source]#: None

fastvideo.training.training_utils.clip_grad_norm_(parameters: torch.Tensor | list[torch.Tensor], max_norm: float, norm_type: float = 2.0, error_if_nonfinite: bool = False, foreach: bool | None = None, pp_mesh: torch.distributed.device_mesh.DeviceMesh | None = None) → torch.Tensor[source]#

Clip the gradient norm of parameters.

Gradient norm clipping requires computing the gradient norm over the entire model. torch.nn.utils.clip_grad_norm_ only computes gradient norm along DP/FSDP/TP dimensions. We need to manually reduce the gradient norm across PP stages. See https://github.com/pytorch/torchtitan/issues/596 for details.

Parameters:

parameters (torch.Tensor or List[torch.Tensor]) – Tensors that will have gradients normalized.
max_norm (float) – Maximum norm of the gradients after clipping.
norm_type (float, defaults to 2.0) – Type of p-norm to use. Can be inf for infinity norm.
error_if_nonfinite (bool, defaults to False) – If True, an error is thrown if the total norm of the gradients from parameters is nan, inf, or -inf.
foreach (bool, defaults to None) – Use the faster foreach-based implementation. If None, use the foreach implementation for CUDA and CPU native tensors and silently fall back to the slow implementation for other device types.
pp_mesh (torch.distributed.device_mesh.DeviceMesh, defaults to None) – Pipeline parallel device mesh. If not None, will reduce gradient norm across PP stages.

Returns:

Total norm of the gradients

Return type:

torch.Tensor

fastvideo.training.training_utils.clip_grad_norm_while_handling_failing_dtensor_cases(parameters: torch.Tensor | list[torch.Tensor], max_norm: float, norm_type: float = 2.0, error_if_nonfinite: bool = False, foreach: bool | None = None, pp_mesh: torch.distributed.device_mesh.DeviceMesh | None = None) → torch.Tensor | None[source]#

fastvideo.training.training_utils.compute_density_for_timestep_sampling(weighting_scheme: str, batch_size: int, generator, logit_mean: float | None = None, logit_std: float | None = None, mode_scale: float | None = None)[source]#

Compute the density for sampling the timesteps when doing SD3 training.

Courtesy: This was contributed by Rafie Walker in https://github.com/huggingface/diffusers/pull/8528.

SD3 paper reference: https://arxiv.org/abs/2403.03206v1.

fastvideo.training.training_utils.count_trainable(model: torch.nn.Module) → int[source]#

fastvideo.training.training_utils.custom_to_hf_state_dict(state_dict: dict[str, Any] | collections.abc.Iterator[tuple[str, torch.Tensor]], reverse_param_names_mapping: dict[str, tuple[str, int, int]]) → dict[str, Any][source]#

Convert fastvideo’s custom model format to diffusers format using reverse_param_names_mapping.

Parameters:

state_dict – State dict in fastvideo’s custom format
reverse_param_names_mapping – Reverse mapping from fastvideo’s custom format to diffusers format

Returns:

State dict in diffusers format

fastvideo.training.training_utils.gather_state_dict_on_cpu_rank0(model, device: torch.device | None = None) → dict[str, Any][source]#

fastvideo.training.training_utils.get_constant_schedule(optimizer: torch.optim.Optimizer, last_epoch: int = -1) → torch.optim.lr_scheduler.LambdaLR[source]#

Create a schedule with a constant learning rate, using the learning rate set in optimizer.

Parameters:

optimizer ([~torch.optim.Optimizer]) – The optimizer for which to schedule the learning rate.
last_epoch (int, optional, defaults to -1) – The index of the last epoch when resuming training.

Returns: