fastvideo.v1.training.training_utils
#
Module Contents#
Functions#
Clip the gradient norm of parameters. |
|
Compute the density for sampling the timesteps when doing SD3 training. |
|
Convert fastvideoās custom model format to diffusers format using reverse_param_names_mapping. |
|
Load checkpoint following finetrainerās distributed checkpoint approach. Returns the step number from which training should resume. |
|
Save checkpoint following finetrainerās distributed checkpoint approach. Saves both distributed checkpoint and consolidated model weights. |
|
Data#
API#
- fastvideo.v1.training.training_utils.clip_grad_norm_(parameters: torch.Tensor | list[torch.Tensor], max_norm: float, norm_type: float = 2.0, error_if_nonfinite: bool = False, foreach: bool | None = None, pp_mesh: torch.distributed.device_mesh.DeviceMesh | None = None) torch.Tensor [source]#
Clip the gradient norm of parameters.
Gradient norm clipping requires computing the gradient norm over the entire model.
torch.nn.utils.clip_grad_norm_
only computes gradient norm along DP/FSDP/TP dimensions. We need to manually reduce the gradient norm across PP stages. See https://github.com/pytorch/torchtitan/issues/596 for details.- Parameters:
parameters (
torch.Tensor
orList[torch.Tensor]
) ā Tensors that will have gradients normalized.max_norm (
float
) ā Maximum norm of the gradients after clipping.norm_type (
float
, defaults to2.0
) ā Type of p-norm to use. Can beinf
for infinity norm.error_if_nonfinite (
bool
, defaults toFalse
) ā IfTrue
, an error is thrown if the total norm of the gradients fromparameters
isnan
,inf
, or-inf
.foreach (
bool
, defaults toNone
) ā Use the faster foreach-based implementation. IfNone
, use the foreach implementation for CUDA and CPU native tensors and silently fall back to the slow implementation for other device types.pp_mesh (
torch.distributed.device_mesh.DeviceMesh
, defaults toNone
) ā Pipeline parallel device mesh. If notNone
, will reduce gradient norm across PP stages.
- Returns:
Total norm of the gradients
- Return type:
torch.Tensor
- fastvideo.v1.training.training_utils.clip_grad_norm_while_handling_failing_dtensor_cases(parameters: torch.Tensor | list[torch.Tensor], max_norm: float, norm_type: float = 2.0, error_if_nonfinite: bool = False, foreach: bool | None = None, pp_mesh: torch.distributed.device_mesh.DeviceMesh | None = None) torch.Tensor | None [source]#
- fastvideo.v1.training.training_utils.compute_density_for_timestep_sampling(weighting_scheme: str, batch_size: int, generator, logit_mean: float | None = None, logit_std: float | None = None, mode_scale: float | None = None)[source]#
Compute the density for sampling the timesteps when doing SD3 training.
Courtesy: This was contributed by Rafie Walker in https://github.com/huggingface/diffusers/pull/8528.
SD3 paper reference: https://arxiv.org/abs/2403.03206v1.
- fastvideo.v1.training.training_utils.custom_to_hf_state_dict(state_dict: dict[str, Any] | collections.abc.Iterator[tuple[str, torch.Tensor]], reverse_param_names_mapping: dict[str, tuple[str, int, int]]) dict[str, Any] [source]#
Convert fastvideoās custom model format to diffusers format using reverse_param_names_mapping.
- Parameters:
state_dict ā State dict in fastvideoās custom format
reverse_param_names_mapping ā Reverse mapping from fastvideoās custom format to diffusers format
- Returns:
State dict in diffusers format
- fastvideo.v1.training.training_utils.gather_state_dict_on_cpu_rank0(model, device: torch.device | None = None) dict[str, Any] [source]#
- fastvideo.v1.training.training_utils.get_sigmas(noise_scheduler, device, timesteps, n_dim=4, dtype=torch.float32) torch.Tensor [source]#
- fastvideo.v1.training.training_utils.load_checkpoint(transformer, rank, checkpoint_path, optimizer=None, dataloader=None, scheduler=None, noise_generator=None) int [source]#
Load checkpoint following finetrainerās distributed checkpoint approach. Returns the step number from which training should resume.
- fastvideo.v1.training.training_utils.normalize_dit_input(model_type, latents, args=None) torch.Tensor [source]#
- fastvideo.v1.training.training_utils.save_checkpoint(transformer, rank, output_dir, step, optimizer=None, dataloader=None, scheduler=None, noise_generator=None) None [source]#
Save checkpoint following finetrainerās distributed checkpoint approach. Saves both distributed checkpoint and consolidated model weights.
- fastvideo.v1.training.training_utils.shard_latents_across_sp(latents: torch.Tensor, num_latent_t: int) torch.Tensor [source]#