Diffusers documentation

MochiTransformer3DModel

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.32.2).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

MochiTransformer3DModel

A Diffusion Transformer model for 3D video-like data was introduced in Mochi-1 Preview by Genmo.

The model can be loaded with the following code snippet.

from diffusers import MochiTransformer3DModel

transformer = MochiTransformer3DModel.from_pretrained("genmo/mochi-1-preview", subfolder="transformer", torch_dtype=torch.float16).to("cuda")

MochiTransformer3DModel

class diffusers.MochiTransformer3DModel

< >

( patch_size: int = 2 num_attention_heads: int = 24 attention_head_dim: int = 128 num_layers: int = 48 pooled_projection_dim: int = 1536 in_channels: int = 12 out_channels: typing.Optional[int] = None qk_norm: str = 'rms_norm' text_embed_dim: int = 4096 time_embed_dim: int = 256 activation_fn: str = 'swiglu' max_sequence_length: int = 256 )

Parameters

  • patch_size (int, defaults to 2) — The size of the patches to use in the patch embedding layer.
  • num_attention_heads (int, defaults to 24) — The number of heads to use for multi-head attention.
  • attention_head_dim (int, defaults to 128) — The number of channels in each head.
  • num_layers (int, defaults to 48) — The number of layers of Transformer blocks to use.
  • in_channels (int, defaults to 12) — The number of channels in the input.
  • out_channels (int, optional, defaults to None) — The number of channels in the output.
  • qk_norm (str, defaults to "rms_norm") — The normalization layer to use.
  • text_embed_dim (int, defaults to 4096) — Input dimension of text embeddings from the text encoder.
  • time_embed_dim (int, defaults to 256) — Output dimension of timestep embeddings.
  • activation_fn (str, defaults to "swiglu") — Activation function to use in feed-forward.
  • max_sequence_length (int, defaults to 256) — The maximum sequence length of text embeddings supported.

A Transformer model for video-like data introduced in Mochi.

Transformer2DModelOutput

class diffusers.models.modeling_outputs.Transformer2DModelOutput

< >

( sample: torch.Tensor )

Parameters

  • sample (torch.Tensor of shape (batch_size, num_channels, height, width) or (batch size, num_vector_embeds - 1, num_latent_pixels) if Transformer2DModel is discrete) — The hidden states output conditioned on the encoder_hidden_states input. If discrete, returns probability distributions for the unnoised latent pixels.

The output of Transformer2DModel.

< > Update on GitHub