fastvideo.v1.layers.visual_embedding#

Module Contents#

Classes#

ModulateProjection

Modulation layer for DiT blocks.

PatchEmbed

2D Image to Patch Embedding

TimestepEmbedder

Embeds scalar timesteps into vector representations.

Functions#

timestep_embedding

Create sinusoidal timestep embeddings.

unpatchify

Convert patched representation back to image space.

API#

class fastvideo.v1.layers.visual_embedding.ModulateProjection(hidden_size: int, factor: int = 2, act_layer: str = 'silu', dtype: Optional[torch.dtype] = None, prefix: str = '')[source]#

Bases: torch.nn.Module

Modulation layer for DiT blocks.

Initialization

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x: torch.Tensor) torch.Tensor[source]#
class fastvideo.v1.layers.visual_embedding.PatchEmbed(patch_size=16, in_chans=3, embed_dim=768, norm_layer=None, flatten=True, bias=True, dtype=None, prefix: str = '')[source]#

Bases: torch.nn.Module

2D Image to Patch Embedding

Image to Patch Embedding using Conv2d

A convolution based approach to patchifying a 2D image w/ embedding projection.

Based on the impl in https://github.com/google-research/vision_transformer

Hacked together by / Copyright 2020 Ross Wightman

Remove the _assert function in forward function to be compatible with multi-resolution images.

Initialization

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x)[source]#
class fastvideo.v1.layers.visual_embedding.TimestepEmbedder(hidden_size, act_layer='silu', frequency_embedding_size=256, max_period=10000, dtype=None, freq_dtype=torch.float32, prefix: str = '')[source]#

Bases: torch.nn.Module

Embeds scalar timesteps into vector representations.

Initialization

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(t: torch.Tensor) torch.Tensor[source]#
fastvideo.v1.layers.visual_embedding.timestep_embedding(t: torch.Tensor, dim: int, max_period: int = 10000, dtype: torch.dtype = torch.float32) torch.Tensor[source]#

Create sinusoidal timestep embeddings.

Parameters:
  • t – Tensor of shape [B] with timesteps

  • dim – Embedding dimension

  • max_period – Controls the minimum frequency of the embeddings

Returns:

Tensor of shape [B, dim] with embeddings

fastvideo.v1.layers.visual_embedding.unpatchify(x, t, h, w, patch_size, channels) torch.Tensor[source]#

Convert patched representation back to image space.

Parameters:
  • x – Tensor of shape [B, THW, CP_tP_h*P_w]

  • t – Temporal and spatial dimensions

  • h – Temporal and spatial dimensions

  • w – Temporal and spatial dimensions

Returns:

Unpatchified tensor of shape [B, C, TP_t, HP_h, W*P_w]