ltx2_audio_vae
¶
Native LTX-2 Audio VAE and Vocoder implementation for FastVideo.
Classes¶
fastvideo.models.audio.ltx2_audio_vae.AttentionType
¶
fastvideo.models.audio.ltx2_audio_vae.AttnBlock
¶
Bases: Module
Vanilla self-attention block for 2D features.
Source code in fastvideo/models/audio/ltx2_audio_vae.py
fastvideo.models.audio.ltx2_audio_vae.AudioDecoder
¶
AudioDecoder(*, ch: int, out_ch: int, ch_mult: Tuple[int, ...] = (1, 2, 4, 8), num_res_blocks: int, attn_resolutions: Set[int], resolution: int, z_channels: int, norm_type: NormType = GROUP, causality_axis: CausalityAxis = WIDTH, dropout: float = 0.0, mid_block_add_attention: bool = True, sample_rate: int = 16000, mel_hop_length: int = 160, is_causal: bool = True, mel_bins: int | None = None)
Bases: Module
Symmetric decoder that reconstructs audio spectrograms from latent features.
Source code in fastvideo/models/audio/ltx2_audio_vae.py
883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 | |
Functions¶
fastvideo.models.audio.ltx2_audio_vae.AudioDecoder.forward
¶
Decode latent features back to audio spectrograms. Args: sample: Encoded latent representation of shape (batch, channels, frames, mel_bins) Returns: Reconstructed audio spectrogram of shape (batch, channels, time, frequency)
Source code in fastvideo/models/audio/ltx2_audio_vae.py
fastvideo.models.audio.ltx2_audio_vae.AudioDecoderConfigurator
¶
Factory for AudioDecoder from checkpoint config.
fastvideo.models.audio.ltx2_audio_vae.AudioEncoder
¶
AudioEncoder(*, ch: int, ch_mult: Tuple[int, ...] = (1, 2, 4, 8), num_res_blocks: int, attn_resolutions: Set[int], dropout: float = 0.0, resamp_with_conv: bool = True, in_channels: int, resolution: int, z_channels: int, double_z: bool = True, attn_type: AttentionType = VANILLA, mid_block_add_attention: bool = True, norm_type: NormType = GROUP, causality_axis: CausalityAxis = WIDTH, sample_rate: int = 16000, mel_hop_length: int = 160, n_fft: int = 1024, is_causal: bool = True, mel_bins: int = 64, **_ignore_kwargs)
Bases: Module
Encoder that compresses audio spectrograms into latent representations.
Source code in fastvideo/models/audio/ltx2_audio_vae.py
725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 | |
Functions¶
fastvideo.models.audio.ltx2_audio_vae.AudioEncoder.forward
¶
Encode audio spectrogram into latent representations. Args: spectrogram: Input spectrogram of shape (batch, channels, time, frequency) Returns: Encoded latent representation of shape (batch, channels, frames, mel_bins)
Source code in fastvideo/models/audio/ltx2_audio_vae.py
fastvideo.models.audio.ltx2_audio_vae.AudioEncoderConfigurator
¶
Factory for AudioEncoder from checkpoint config.
fastvideo.models.audio.ltx2_audio_vae.AudioLatentShape
¶
fastvideo.models.audio.ltx2_audio_vae.AudioPatchifier
¶
AudioPatchifier(patch_size: int = 1, sample_rate: int = 16000, hop_length: int = 160, audio_latent_downsample_factor: int = 4, is_causal: bool = True)
Simple patchifier for audio latents.
Source code in fastvideo/models/audio/ltx2_audio_vae.py
Functions¶
fastvideo.models.audio.ltx2_audio_vae.AudioPatchifier.patchify
¶
Flatten audio latent tensor along time: (B, C, T, F) -> (B, T, C*F).
fastvideo.models.audio.ltx2_audio_vae.AudioPatchifier.unpatchify
¶
unpatchify(audio_latents: Tensor, output_shape: AudioLatentShape) -> Tensor
Restore (B, C, T, F) from flattened patches: (B, T, C*F) -> (B, C, T, F).
Source code in fastvideo/models/audio/ltx2_audio_vae.py
fastvideo.models.audio.ltx2_audio_vae.CausalConv2d
¶
CausalConv2d(in_channels: int, out_channels: int, kernel_size: int | Tuple[int, int], stride: int = 1, dilation: int | Tuple[int, int] = 1, groups: int = 1, bias: bool = True, causality_axis: CausalityAxis = HEIGHT)
Bases: Module
A causal 2D convolution. Ensures output at time t only depends on inputs at time t and earlier.
Source code in fastvideo/models/audio/ltx2_audio_vae.py
fastvideo.models.audio.ltx2_audio_vae.CausalityAxis
¶
fastvideo.models.audio.ltx2_audio_vae.Downsample
¶
Downsample(in_channels: int, with_conv: bool, causality_axis: CausalityAxis = WIDTH)
Bases: Module
Downsampling layer with strided convolution or average pooling.
Source code in fastvideo/models/audio/ltx2_audio_vae.py
fastvideo.models.audio.ltx2_audio_vae.LTX2AudioDecoder
¶
fastvideo.models.audio.ltx2_audio_vae.LTX2AudioEncoder
¶
fastvideo.models.audio.ltx2_audio_vae.LTX2Vocoder
¶
fastvideo.models.audio.ltx2_audio_vae.NormType
¶
fastvideo.models.audio.ltx2_audio_vae.PerChannelStatistics
¶
PerChannelStatistics(latent_channels: int = 128)
Bases: Module
Per-channel statistics for normalizing and denormalizing the latent representation.
Source code in fastvideo/models/audio/ltx2_audio_vae.py
fastvideo.models.audio.ltx2_audio_vae.PixelNorm
¶
fastvideo.models.audio.ltx2_audio_vae.ResBlock1
¶
Bases: Module
1D ResBlock for vocoder with dilated convolutions.
Source code in fastvideo/models/audio/ltx2_audio_vae.py
fastvideo.models.audio.ltx2_audio_vae.ResBlock2
¶
Bases: Module
1D ResBlock for vocoder (simpler version).
Source code in fastvideo/models/audio/ltx2_audio_vae.py
fastvideo.models.audio.ltx2_audio_vae.ResnetBlock
¶
ResnetBlock(*, in_channels: int, out_channels: int | None = None, conv_shortcut: bool = False, dropout: float = 0.0, temb_channels: int = 512, norm_type: NormType = GROUP, causality_axis: CausalityAxis = HEIGHT)
Bases: Module
2D ResNet block for audio VAE.
Source code in fastvideo/models/audio/ltx2_audio_vae.py
fastvideo.models.audio.ltx2_audio_vae.Upsample
¶
Upsample(in_channels: int, with_conv: bool, causality_axis: CausalityAxis = HEIGHT)
Bases: Module
Upsampling layer with nearest-neighbor interpolation and optional convolution.
Source code in fastvideo/models/audio/ltx2_audio_vae.py
fastvideo.models.audio.ltx2_audio_vae.Vocoder
¶
Vocoder(resblock_kernel_sizes: List[int] | None = None, upsample_rates: List[int] | None = None, upsample_kernel_sizes: List[int] | None = None, resblock_dilation_sizes: List[List[int]] | None = None, upsample_initial_channel: int = 1024, stereo: bool = True, resblock: str = '1', output_sample_rate: int = 24000)
Bases: Module
Vocoder model for synthesizing audio from Mel spectrograms.
Source code in fastvideo/models/audio/ltx2_audio_vae.py
Functions¶
fastvideo.models.audio.ltx2_audio_vae.Vocoder.forward
¶
Forward pass of the vocoder. Args: x: Input Mel spectrogram tensor of shape (batch, channels, time, mel_bins) Returns: Audio waveform tensor of shape (batch, out_channels, audio_length)
Source code in fastvideo/models/audio/ltx2_audio_vae.py
fastvideo.models.audio.ltx2_audio_vae.VocoderConfigurator
¶
Factory for Vocoder from checkpoint config.
Functions¶
fastvideo.models.audio.ltx2_audio_vae.build_downsampling_path
¶
build_downsampling_path(*, ch: int, ch_mult: Tuple[int, ...], num_resolutions: int, num_res_blocks: int, resolution: int, temb_channels: int, dropout: float, norm_type: NormType, causality_axis: CausalityAxis, attn_type: AttentionType, attn_resolutions: Set[int], resamp_with_conv: bool) -> Tuple[ModuleList, int]
Build the downsampling path with residual blocks, attention, and downsampling layers.
Source code in fastvideo/models/audio/ltx2_audio_vae.py
fastvideo.models.audio.ltx2_audio_vae.build_mid_block
¶
build_mid_block(channels: int, temb_channels: int, dropout: float, norm_type: NormType, causality_axis: CausalityAxis, attn_type: AttentionType, add_attention: bool) -> Module
Build the middle block with two ResNet blocks and optional attention.
Source code in fastvideo/models/audio/ltx2_audio_vae.py
fastvideo.models.audio.ltx2_audio_vae.build_normalization_layer
¶
build_normalization_layer(in_channels: int, *, num_groups: int = 32, normtype: NormType = GROUP) -> Module
Create a normalization layer based on the normalization type.
Source code in fastvideo/models/audio/ltx2_audio_vae.py
fastvideo.models.audio.ltx2_audio_vae.build_upsampling_path
¶
build_upsampling_path(*, ch: int, ch_mult: Tuple[int, ...], num_resolutions: int, num_res_blocks: int, resolution: int, temb_channels: int, dropout: float, norm_type: NormType, causality_axis: CausalityAxis, attn_type: AttentionType, attn_resolutions: Set[int], resamp_with_conv: bool, initial_block_channels: int) -> Tuple[ModuleList, int]
Build the upsampling path with residual blocks, attention, and upsampling layers.
Source code in fastvideo/models/audio/ltx2_audio_vae.py
fastvideo.models.audio.ltx2_audio_vae.decode_audio
¶
decode_audio(latent: Tensor, audio_decoder: AudioDecoder, vocoder: Vocoder) -> Tensor
Decode an audio latent representation using the provided audio decoder and vocoder.
Source code in fastvideo/models/audio/ltx2_audio_vae.py
fastvideo.models.audio.ltx2_audio_vae.make_attn
¶
make_attn(in_channels: int, attn_type: AttentionType = VANILLA, norm_type: NormType = GROUP) -> Module
Factory function for attention blocks.
Source code in fastvideo/models/audio/ltx2_audio_vae.py
fastvideo.models.audio.ltx2_audio_vae.make_conv2d
¶
make_conv2d(in_channels: int, out_channels: int, kernel_size: int | Tuple[int, int], stride: int = 1, padding: Tuple[int, int, int, int] | None = None, dilation: int = 1, groups: int = 1, bias: bool = True, causality_axis: CausalityAxis | None = None) -> Module
Create a 2D convolution layer that can be either causal or non-causal.
Source code in fastvideo/models/audio/ltx2_audio_vae.py
fastvideo.models.audio.ltx2_audio_vae.run_mid_block
¶
Run features through the middle block.