fastvideo.v1.layers.visual_embedding

`fastvideo.v1.layers.visual_embedding`#

Module Contents#

Classes#

`ModulateProjection`	Modulation layer for DiT blocks.
`PatchEmbed`	2D Image to Patch Embedding
`TimestepEmbedder`	Embeds scalar timesteps into vector representations.

Functions#

`timestep_embedding`	Create sinusoidal timestep embeddings.
`unpatchify`	Convert patched representation back to image space.

API#

class fastvideo.v1.layers.visual_embedding.ModulateProjection(hidden_size: int, factor: int = 2, act_layer: str = 'silu', dtype: torch.dtype | None = None, prefix: str = '')[source]#

Bases: torch.nn.Module

Modulation layer for DiT blocks.

Initialization

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x: torch.Tensor) → torch.Tensor[source]#

class fastvideo.v1.layers.visual_embedding.PatchEmbed(patch_size=16, in_chans=3, embed_dim=768, norm_layer=None, flatten=True, bias=True, dtype=None, prefix: str = '')[source]#

Bases: torch.nn.Module

2D Image to Patch Embedding

Image to Patch Embedding using Conv2d

A convolution based approach to patchifying a 2D image w/ embedding projection.

Based on the impl in https://github.com/google-research/vision_transformer

Remove the _assert function in forward function to be compatible with multi-resolution images.

Initialization

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x)[source]#

class fastvideo.v1.layers.visual_embedding.TimestepEmbedder(hidden_size, act_layer='silu', frequency_embedding_size=256, max_period=10000, dtype=None, freq_dtype=torch.float32, prefix: str = '')[source]#

Bases: torch.nn.Module

Embeds scalar timesteps into vector representations.

Initialization

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(t: torch.Tensor) → torch.Tensor[source]#

fastvideo.v1.layers.visual_embedding.timestep_embedding(t: torch.Tensor, dim: int, max_period: int = 10000, dtype: torch.dtype = torch.float32) → torch.Tensor[source]#

Create sinusoidal timestep embeddings.

Parameters:

t – Tensor of shape [B] with timesteps
dim – Embedding dimension
max_period – Controls the minimum frequency of the embeddings

Returns:

Tensor of shape [B, dim] with embeddings

fastvideo.v1.layers.visual_embedding.unpatchify(x, t, h, w, patch_size, channels) → torch.Tensor[source]#

Convert patched representation back to image space.

Parameters:

x – Tensor of shape [B, THW, CP_tP_h*P_w]
t – Temporal and spatial dimensions
h – Temporal and spatial dimensions
w – Temporal and spatial dimensions

Returns:

Unpatchified tensor of shape [B, C, TP_t, HP_h, W*P_w]

fastvideo.v1.layers.visual_embedding

Contents

fastvideo.v1.layers.visual_embedding#

Module Contents#

Classes#

Functions#

API#

`fastvideo.v1.layers.visual_embedding`#