autocast.encoders.dc#

class DCEncoder(in_channels, out_channels, hid_channels=(64, 128, 256), hid_blocks=(3, 3, 3), kernel_size=3, stride=2, pixel_shuffle=True, norm='layer', attention_heads=None, ffn_factor=1, spatial=2, patch_size=1, periodic=False, dropout=None, checkpointing=False, identity_init=True, ffn_out_scale=None, saturation=None, saturation_scale=5.0)[source]#

Bases: EncoderWithCond

Deep Compressed (DC) encoder module.

Progressively downsamples input to latent representation using residual blocks with optional attention.

Parameters:
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output (latent) channels.

  • hid_channels (Sequence[int]) – Number of channels at each depth level.

  • hid_blocks (Sequence[int]) – Number of residual blocks at each depth level.

  • kernel_size (int | Sequence[int]) – Kernel size for convolutions.

  • stride (int | Sequence[int]) – Stride for downsampling operations.

  • pixel_shuffle (bool) – Whether to use pixel shuffling (patchify) for downsampling.

  • norm (str) – Type of normalization (‘layer’ or ‘group’).

  • attention_heads (dict[int, int] | None) – Dict mapping depth index to number of attention heads.

  • ffn_factor (int) – Channel expansion factor in FFN blocks.

  • spatial (int) – Number of spatial dimensions (2 for 2D, 3 for 3D).

  • patch_size (int | Sequence[int]) – Patch size for patchifying at the start.

  • periodic (bool) – Whether spatial dimensions are periodic (use circular padding).

  • dropout (float | None) – Dropout rate.

  • checkpointing (bool) – Whether to use gradient checkpointing.

  • identity_init (bool) – Initialize down/upsampling convolutions as identity.

  • ffn_out_scale (float | None) – Optional multiplicative scale applied to each ResBlock FFN output conv.

  • saturation (str | None) – Optional latent saturation mode. Supported: {“softclip2”, “softclip”, “tanh”, “arcsinh”, “rmsnorm”}.

  • saturation_scale (float) – Saturation scale B used by soft clipping/tanh variants.

Note

Based on the implementation from: - Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models (Chen et al., 2024), https://arxiv.org/abs/2410.10733v1 - Lost in Latent Space: An Empirical Study of Latent Diffusion Models

for Physics Emulation (Rozet et al., 2024), https://arxiv.org/abs/2507.02608, PolymathicAI/lola

channel_axis: int = -1#
encoder_model: Module#
encode(batch)[source]#

Encode input batch to latent representation.

Parameters:

batch (Batch) – Input batch containing input_fields with shape (B, T, spatial…, C_i).

Returns:

Encoded latent tensor with shape (B, T, spatial_reduced…, C_o).

Return type:

Float[Tensor, ‘batch time spatial *spatial channel’]

encode_tensor(x)[source]#

Forward pass through encoder (for direct tensor input).

Parameters:

x (Float[Tensor, 'batch time spatial *spatial channel']) – Input tensor with shape (B, T, spatial..., C_i).

Returns:

Encoded latent tensor.

Return type:

Float[Tensor, ‘batch time spatial *spatial channel’]