visual_embedding
¶
Classes¶
fastvideo.layers.visual_embedding.ModulateProjection
¶
ModulateProjection(hidden_size: int, factor: int = 2, act_layer: str = 'silu', dtype: dtype | None = None, prefix: str = '')
Bases: Module
Modulation layer for DiT blocks.
Source code in fastvideo/layers/visual_embedding.py
fastvideo.layers.visual_embedding.PatchEmbed
¶
PatchEmbed(patch_size=16, in_chans=3, embed_dim=768, norm_layer=None, flatten=True, bias=True, dtype=None, prefix: str = '')
Bases: Module
2D Image to Patch Embedding
Image to Patch Embedding using Conv2d
A convolution based approach to patchifying a 2D image w/ embedding projection.
Based on the impl in https://github.com/google-research/vision_transformer
Hacked together by / Copyright 2020 Ross Wightman
Remove the _assert function in forward function to be compatible with multi-resolution images.
Source code in fastvideo/layers/visual_embedding.py
fastvideo.layers.visual_embedding.TimestepEmbedder
¶
TimestepEmbedder(hidden_size, act_layer='silu', frequency_embedding_size=256, max_period=10000, dtype=None, freq_dtype=float32, prefix: str = '')
Bases: Module
Embeds scalar timesteps into vector representations.
Source code in fastvideo/layers/visual_embedding.py
Functions¶
fastvideo.layers.visual_embedding.get_timestep_embedding
¶
get_timestep_embedding(timesteps: Tensor, embedding_dim: int, flip_sin_to_cos: bool = False, downscale_freq_shift: float = 1, scale: float = 1, max_period: int = 10000) -> Tensor
This matches the implementation in Denoising Diffusion Probabilistic Models: Create sinusoidal timestep embeddings.
Args
timesteps (torch.Tensor):
a 1-D Tensor of N indices, one per batch element. These may be fractional.
embedding_dim (int):
the dimension of the output.
flip_sin_to_cos (bool):
Whether the embedding order should be cos, sin (if True) or sin, cos (if False)
downscale_freq_shift (float):
Controls the delta between frequencies between dimensions
scale (float):
Scaling factor applied to the embeddings.
max_period (int):
Controls the maximum frequency of the embeddings
Returns
torch.Tensor: an [N x dim] Tensor of positional embeddings.
Source code in fastvideo/layers/visual_embedding.py
fastvideo.layers.visual_embedding.timestep_embedding
¶
Create sinusoidal timestep embeddings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
t
|
Tensor
|
Tensor of shape [B] with timesteps |
required |
dim
|
int
|
Embedding dimension |
required |
max_period
|
int
|
Controls the minimum frequency of the embeddings |
10000
|
Returns:
| Type | Description |
|---|---|
Tensor
|
Tensor of shape [B, dim] with embeddings |
Source code in fastvideo/layers/visual_embedding.py
fastvideo.layers.visual_embedding.unpatchify
¶
Convert patched representation back to image space.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
Tensor of shape [B, THW, CP_tP_h*P_w] |
required | |
t, h, w
|
Temporal and spatial dimensions |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Unpatchified tensor of shape [B, C, TP_t, HP_h, W*P_w] |