fastvideo.training.training_utils#

Module Contents#

Classes#

Functions#

clip_grad_norm_

Clip the gradient norm of parameters.

clip_grad_norm_while_handling_failing_dtensor_cases

compute_density_for_timestep_sampling

Compute the density for sampling the timesteps when doing SD3 training.

custom_to_hf_state_dict

Convert fastvideo’s custom model format to diffusers format using reverse_param_names_mapping.

gather_state_dict_on_cpu_rank0

get_constant_schedule

Create a schedule with a constant learning rate, using the learning rate set in optimizer.

get_constant_schedule_with_warmup

Create a schedule with a constant learning rate preceded by a warmup period during which the learning rate increases linearly between 0 and the initial lr set in the optimizer.

get_cosine_schedule_with_min_lr

Create a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the optimizer to a minimum lr (min_lr_ratio * initial_lr), after a warmup period during which it increases linearly between 0 and the initial lr set in the optimizer.

get_cosine_schedule_with_warmup

Create a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the optimizer to 0, after a warmup period during which it increases linearly between 0 and the initial lr set in the optimizer.

get_cosine_with_hard_restarts_schedule_with_warmup

Create a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the optimizer to 0, with several hard restarts, after a warmup period during which it increases linearly between 0 and the initial lr set in the optimizer.

get_linear_schedule_with_warmup

Create a schedule with a learning rate that decreases linearly from the initial lr set in the optimizer to 0, after a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer.

get_piecewise_constant_schedule

Create a schedule with a constant learning rate, using the learning rate set in optimizer.

get_polynomial_decay_schedule_with_warmup

Create a schedule with a learning rate that decreases as a polynomial decay from the initial lr set in the optimizer to end lr defined by lr_end, after a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer.

get_scheduler

Unified API to get any scheduler from its name.

get_sigmas

load_checkpoint

Load checkpoint following finetrainer’s distributed checkpoint approach. Returns the step number from which training should resume.

load_distillation_checkpoint

Load distillation checkpoint with both generator and fake_score models. Returns the step number from which training should resume.

normalize_dit_input

pred_noise_to_pred_video

Convert predicted noise to clean latent.

save_checkpoint

Save checkpoint following finetrainer’s distributed checkpoint approach. Saves both distributed checkpoint and consolidated model weights.

save_distillation_checkpoint

Save distillation checkpoint with both generator and fake_score models. Saves both distributed checkpoint and consolidated model weights. Only saves the generator model for inference (consolidated weights).

shard_latents_across_sp

shift_timestep

Data#

API#

fastvideo.training.training_utils.SchedulerFunction[source]#

None

class fastvideo.training.training_utils.SchedulerType[source]#

Bases: enum.Enum

CONSTANT[source]#

‘constant’

CONSTANT_WITH_WARMUP[source]#

‘constant_with_warmup’

COSINE[source]#

‘cosine’

COSINE_WITH_MIN_LR[source]#

‘cosine_with_min_lr’

COSINE_WITH_RESTARTS[source]#

‘cosine_with_restarts’

LINEAR[source]#

‘linear’

PIECEWISE_CONSTANT[source]#

‘piecewise_constant’

POLYNOMIAL[source]#

‘polynomial’

fastvideo.training.training_utils.TYPE_TO_SCHEDULER_FUNCTION: dict[fastvideo.training.training_utils.SchedulerType, fastvideo.training.training_utils.SchedulerFunction][source]#

None

fastvideo.training.training_utils.clip_grad_norm_(parameters: torch.Tensor | list[torch.Tensor], max_norm: float, norm_type: float = 2.0, error_if_nonfinite: bool = False, foreach: bool | None = None, pp_mesh: torch.distributed.device_mesh.DeviceMesh | None = None) torch.Tensor[source]#

Clip the gradient norm of parameters.

Gradient norm clipping requires computing the gradient norm over the entire model. torch.nn.utils.clip_grad_norm_ only computes gradient norm along DP/FSDP/TP dimensions. We need to manually reduce the gradient norm across PP stages. See https://github.com/pytorch/torchtitan/issues/596 for details.

Parameters:
  • parameters (torch.Tensor or List[torch.Tensor]) – Tensors that will have gradients normalized.

  • max_norm (float) – Maximum norm of the gradients after clipping.

  • norm_type (float, defaults to 2.0) – Type of p-norm to use. Can be inf for infinity norm.

  • error_if_nonfinite (bool, defaults to False) – If True, an error is thrown if the total norm of the gradients from parameters is nan, inf, or -inf.

  • foreach (bool, defaults to None) – Use the faster foreach-based implementation. If None, use the foreach implementation for CUDA and CPU native tensors and silently fall back to the slow implementation for other device types.

  • pp_mesh (torch.distributed.device_mesh.DeviceMesh, defaults to None) – Pipeline parallel device mesh. If not None, will reduce gradient norm across PP stages.

Returns:

Total norm of the gradients

Return type:

torch.Tensor

fastvideo.training.training_utils.clip_grad_norm_while_handling_failing_dtensor_cases(parameters: torch.Tensor | list[torch.Tensor], max_norm: float, norm_type: float = 2.0, error_if_nonfinite: bool = False, foreach: bool | None = None, pp_mesh: torch.distributed.device_mesh.DeviceMesh | None = None) torch.Tensor | None[source]#
fastvideo.training.training_utils.compute_density_for_timestep_sampling(weighting_scheme: str, batch_size: int, generator, logit_mean: float | None = None, logit_std: float | None = None, mode_scale: float | None = None)[source]#

Compute the density for sampling the timesteps when doing SD3 training.

Courtesy: This was contributed by Rafie Walker in https://github.com/huggingface/diffusers/pull/8528.

SD3 paper reference: https://arxiv.org/abs/2403.03206v1.

fastvideo.training.training_utils.custom_to_hf_state_dict(state_dict: dict[str, Any] | collections.abc.Iterator[tuple[str, torch.Tensor]], reverse_param_names_mapping: dict[str, tuple[str, int, int]]) dict[str, Any][source]#

Convert fastvideo’s custom model format to diffusers format using reverse_param_names_mapping.

Parameters:
  • state_dict – State dict in fastvideo’s custom format

  • reverse_param_names_mapping – Reverse mapping from fastvideo’s custom format to diffusers format

Returns:

State dict in diffusers format

fastvideo.training.training_utils.gather_state_dict_on_cpu_rank0(model, device: torch.device | None = None) dict[str, Any][source]#
fastvideo.training.training_utils.get_constant_schedule(optimizer: torch.optim.Optimizer, last_epoch: int = -1) torch.optim.lr_scheduler.LambdaLR[source]#

Create a schedule with a constant learning rate, using the learning rate set in optimizer.

Parameters:
  • optimizer ([~torch.optim.Optimizer]) – The optimizer for which to schedule the learning rate.

  • last_epoch (int, optional, defaults to -1) – The index of the last epoch when resuming training.

Returns:

torch.optim.lr_scheduler.LambdaLR with the appropriate schedule.

fastvideo.training.training_utils.get_constant_schedule_with_warmup(optimizer: torch.optim.Optimizer, num_warmup_steps: int, last_epoch: int = -1) torch.optim.lr_scheduler.LambdaLR[source]#

Create a schedule with a constant learning rate preceded by a warmup period during which the learning rate increases linearly between 0 and the initial lr set in the optimizer.

Parameters:
  • optimizer ([~torch.optim.Optimizer]) – The optimizer for which to schedule the learning rate.

  • num_warmup_steps (int) – The number of steps for the warmup phase.

  • last_epoch (int, optional, defaults to -1) – The index of the last epoch when resuming training.

Returns:

torch.optim.lr_scheduler.LambdaLR with the appropriate schedule.

fastvideo.training.training_utils.get_cosine_schedule_with_min_lr(optimizer: torch.optim.Optimizer, num_warmup_steps: int, num_training_steps: int, min_lr_ratio: float = 0.1, num_cycles: float = 0.5, last_epoch: int = -1) torch.optim.lr_scheduler.LambdaLR[source]#

Create a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the optimizer to a minimum lr (min_lr_ratio * initial_lr), after a warmup period during which it increases linearly between 0 and the initial lr set in the optimizer.

Parameters:
  • optimizer ([~torch.optim.Optimizer]) – The optimizer for which to schedule the learning rate.

  • num_warmup_steps (int) – The number of steps for the warmup phase.

  • num_training_steps (int) – The total number of training steps.

  • min_lr_ratio (float, optional, defaults to 0.1) – The ratio of minimum learning rate to initial learning rate.

  • num_cycles (float, optional, defaults to 0.5) – The number of periods of the cosine function in a schedule.

  • last_epoch (int, optional, defaults to -1) – The index of the last epoch when resuming training.

Returns:

torch.optim.lr_scheduler.LambdaLR with the appropriate schedule.

fastvideo.training.training_utils.get_cosine_schedule_with_warmup(optimizer: torch.optim.Optimizer, num_warmup_steps: int, num_training_steps: int, num_cycles: float = 0.5, last_epoch: int = -1) torch.optim.lr_scheduler.LambdaLR[source]#

Create a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the optimizer to 0, after a warmup period during which it increases linearly between 0 and the initial lr set in the optimizer.

Parameters:
  • optimizer ([~torch.optim.Optimizer]) – The optimizer for which to schedule the learning rate.

  • num_warmup_steps (int) – The number of steps for the warmup phase.

  • num_training_steps (int) – The total number of training steps.

  • num_periods (float, optional, defaults to 0.5) – The number of periods of the cosine function in a schedule (the default is to just decrease from the max value to 0 following a half-cosine).

  • last_epoch (int, optional, defaults to -1) – The index of the last epoch when resuming training.

Returns:

torch.optim.lr_scheduler.LambdaLR with the appropriate schedule.

fastvideo.training.training_utils.get_cosine_with_hard_restarts_schedule_with_warmup(optimizer: torch.optim.Optimizer, num_warmup_steps: int, num_training_steps: int, num_cycles: int = 1, last_epoch: int = -1) torch.optim.lr_scheduler.LambdaLR[source]#

Create a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the optimizer to 0, with several hard restarts, after a warmup period during which it increases linearly between 0 and the initial lr set in the optimizer.

Parameters:
  • optimizer ([~torch.optim.Optimizer]) – The optimizer for which to schedule the learning rate.

  • num_warmup_steps (int) – The number of steps for the warmup phase.

  • num_training_steps (int) – The total number of training steps.

  • num_cycles (int, optional, defaults to 1) – The number of hard restarts to use.

  • last_epoch (int, optional, defaults to -1) – The index of the last epoch when resuming training.

Returns:

torch.optim.lr_scheduler.LambdaLR with the appropriate schedule.

fastvideo.training.training_utils.get_linear_schedule_with_warmup(optimizer: torch.optim.Optimizer, num_warmup_steps: int, num_training_steps: int, last_epoch: int = -1) torch.optim.lr_scheduler.LambdaLR[source]#

Create a schedule with a learning rate that decreases linearly from the initial lr set in the optimizer to 0, after a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer.

Parameters:
  • optimizer ([~torch.optim.Optimizer]) – The optimizer for which to schedule the learning rate.

  • num_warmup_steps (int) – The number of steps for the warmup phase.

  • num_training_steps (int) – The total number of training steps.

  • last_epoch (int, optional, defaults to -1) – The index of the last epoch when resuming training.

Returns:

torch.optim.lr_scheduler.LambdaLR with the appropriate schedule.

fastvideo.training.training_utils.get_piecewise_constant_schedule(optimizer: torch.optim.Optimizer, step_rules: str, last_epoch: int = -1) torch.optim.lr_scheduler.LambdaLR[source]#

Create a schedule with a constant learning rate, using the learning rate set in optimizer.

Parameters:
  • optimizer ([~torch.optim.Optimizer]) – The optimizer for which to schedule the learning rate.

  • step_rules (string) – The rules for the learning rate. ex: rule_steps=”1:10,0.1:20,0.01:30,0.005” it means that the learning rate if multiple 1 for the first 10 steps, multiple 0.1 for the next 20 steps, multiple 0.01 for the next 30 steps and multiple 0.005 for the other steps.

  • last_epoch (int, optional, defaults to -1) – The index of the last epoch when resuming training.

Returns:

torch.optim.lr_scheduler.LambdaLR with the appropriate schedule.

fastvideo.training.training_utils.get_polynomial_decay_schedule_with_warmup(optimizer: torch.optim.Optimizer, num_warmup_steps: int, num_training_steps: int, lr_end: float = 1e-07, power: float = 1.0, last_epoch: int = -1) torch.optim.lr_scheduler.LambdaLR[source]#

Create a schedule with a learning rate that decreases as a polynomial decay from the initial lr set in the optimizer to end lr defined by lr_end, after a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer.

Parameters:
  • optimizer ([~torch.optim.Optimizer]) – The optimizer for which to schedule the learning rate.

  • num_warmup_steps (int) – The number of steps for the warmup phase.

  • num_training_steps (int) – The total number of training steps.

  • lr_end (float, optional, defaults to 1e-7) – The end LR.

  • power (float, optional, defaults to 1.0) – Power factor.

  • last_epoch (int, optional, defaults to -1) – The index of the last epoch when resuming training.

Note: power defaults to 1.0 as in the fairseq implementation, which in turn is based on the original BERT implementation at https://github.com/google-research/bert/blob/f39e881b169b9d53bea03d2d341b31707a6c052b/optimization.py#L37

Returns:

torch.optim.lr_scheduler.LambdaLR with the appropriate schedule.

fastvideo.training.training_utils.get_scheduler(name: str | fastvideo.training.training_utils.SchedulerType, optimizer: torch.optim.Optimizer, step_rules: str | None = None, num_warmup_steps: int | None = None, num_training_steps: int | None = None, num_cycles: int = 1, power: float = 1.0, min_lr_ratio: float = 0.1, last_epoch: int = -1) torch.optim.lr_scheduler.LambdaLR[source]#

Unified API to get any scheduler from its name.

Parameters:
  • name (str or SchedulerType) – The name of the scheduler to use.

  • optimizer (torch.optim.Optimizer) – The optimizer that will be used during training.

  • step_rules (str, optional) – A string representing the step rules to use. This is only used by the PIECEWISE_CONSTANT scheduler.

  • num_warmup_steps (int, optional) – The number of warmup steps to do. This is not required by all schedulers (hence the argument being optional), the function will raise an error if it’s unset and the scheduler type requires it.

  • num_training_steps (`int``, optional) – The number of training steps to do. This is not required by all schedulers (hence the argument being optional), the function will raise an error if it’s unset and the scheduler type requires it.

  • num_cycles (int, optional) – The number of hard restarts used in COSINE_WITH_RESTARTS scheduler.

  • power (float, optional, defaults to 1.0) – Power factor. See POLYNOMIAL scheduler

  • min_lr_ratio (float, optional, defaults to 0.1) – The ratio of minimum learning rate to initial learning rate. Used in COSINE_WITH_MIN_LR scheduler.

  • last_epoch (int, optional, defaults to -1) – The index of the last epoch when resuming training.

fastvideo.training.training_utils.get_sigmas(noise_scheduler, device, timesteps, n_dim=4, dtype=torch.float32) torch.Tensor[source]#
fastvideo.training.training_utils.load_checkpoint(transformer, rank, checkpoint_path, optimizer=None, dataloader=None, scheduler=None, noise_generator=None) int[source]#

Load checkpoint following finetrainer’s distributed checkpoint approach. Returns the step number from which training should resume.

fastvideo.training.training_utils.load_distillation_checkpoint(generator_transformer, fake_score_transformer, rank, checkpoint_path, generator_optimizer=None, fake_score_optimizer=None, dataloader=None, generator_scheduler=None, fake_score_scheduler=None, noise_generator=None) int[source]#

Load distillation checkpoint with both generator and fake_score models. Returns the step number from which training should resume.

fastvideo.training.training_utils.logger[source]#

‘init_logger(…)’

fastvideo.training.training_utils.normalize_dit_input(model_type, latents, vae) torch.Tensor[source]#
fastvideo.training.training_utils.pred_noise_to_pred_video(pred_noise: torch.Tensor, noise_input_latent: torch.Tensor, timestep: torch.Tensor, scheduler: Any) torch.Tensor[source]#

Convert predicted noise to clean latent.

fastvideo.training.training_utils.save_checkpoint(transformer, rank, output_dir, step, optimizer=None, dataloader=None, scheduler=None, noise_generator=None) None[source]#

Save checkpoint following finetrainer’s distributed checkpoint approach. Saves both distributed checkpoint and consolidated model weights.

fastvideo.training.training_utils.save_distillation_checkpoint(generator_transformer, fake_score_transformer, rank, output_dir, step, generator_optimizer=None, fake_score_optimizer=None, dataloader=None, generator_scheduler=None, fake_score_scheduler=None, noise_generator=None, only_save_generator_weight=False) None[source]#

Save distillation checkpoint with both generator and fake_score models. Saves both distributed checkpoint and consolidated model weights. Only saves the generator model for inference (consolidated weights).

Parameters:

only_save_generator_weight – If True, only save the generator model weights for inference without saving distributed checkpoint for training resume.

fastvideo.training.training_utils.shard_latents_across_sp(latents: torch.Tensor, num_latent_t: int) torch.Tensor[source]#
fastvideo.training.training_utils.shift_timestep(timestep: torch.Tensor, shift: float, num_train_timestep: float) torch.Tensor[source]#