fastvideo.v1.layers.visual_embedding
#
Module Contents#
Classes#
Modulation layer for DiT blocks. |
|
2D Image to Patch Embedding |
|
Embeds scalar timesteps into vector representations. |
Functions#
Create sinusoidal timestep embeddings. |
|
Convert patched representation back to image space. |
API#
- class fastvideo.v1.layers.visual_embedding.ModulateProjection(hidden_size: int, factor: int = 2, act_layer: str = 'silu', dtype: Optional[torch.dtype] = None, prefix: str = '')[source]#
Bases:
torch.nn.Module
Modulation layer for DiT blocks.
Initialization
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(x: torch.Tensor) torch.Tensor [source]#
- class fastvideo.v1.layers.visual_embedding.PatchEmbed(patch_size=16, in_chans=3, embed_dim=768, norm_layer=None, flatten=True, bias=True, dtype=None, prefix: str = '')[source]#
Bases:
torch.nn.Module
2D Image to Patch Embedding
Image to Patch Embedding using Conv2d
A convolution based approach to patchifying a 2D image w/ embedding projection.
Based on the impl in https://github.com/google-research/vision_transformer
Hacked together by / Copyright 2020 Ross Wightman
Remove the _assert function in forward function to be compatible with multi-resolution images.
Initialization
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- class fastvideo.v1.layers.visual_embedding.TimestepEmbedder(hidden_size, act_layer='silu', frequency_embedding_size=256, max_period=10000, dtype=None, freq_dtype=torch.float32, prefix: str = '')[source]#
Bases:
torch.nn.Module
Embeds scalar timesteps into vector representations.
Initialization
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(t: torch.Tensor) torch.Tensor [source]#
- fastvideo.v1.layers.visual_embedding.timestep_embedding(t: torch.Tensor, dim: int, max_period: int = 10000, dtype: torch.dtype = torch.float32) torch.Tensor [source]#
Create sinusoidal timestep embeddings.
- Parameters:
t – Tensor of shape [B] with timesteps
dim – Embedding dimension
max_period – Controls the minimum frequency of the embeddings
- Returns:
Tensor of shape [B, dim] with embeddings
- fastvideo.v1.layers.visual_embedding.unpatchify(x, t, h, w, patch_size, channels) torch.Tensor [source]#
Convert patched representation back to image space.
- Parameters:
x – Tensor of shape [B, THW, CP_tP_h*P_w]
t – Temporal and spatial dimensions
h – Temporal and spatial dimensions
w – Temporal and spatial dimensions
- Returns:
Unpatchified tensor of shape [B, C, TP_t, HP_h, W*P_w]