Emu3
Overview
The Emu3 model was proposed in Emu3: Next-Token Prediction is All You Need by Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, Zhongyuan Wang.
Emu3 is a multimodal LLM that uses vector quantization to tokenize images into discrete tokens. Discretized image tokens are later fused with text token ids for image and text generation. The model can additionally generate images by predicting image token ids.
The abstract from the paper is the following:
While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.g., CLIP combined with LLMs). In this paper, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship models such as SDXL and LLaVA-1.6, while eliminating the need for diffusion or compositional architectures. Emu3 is also capable of generating high-fidelity video via predicting the next token in a video sequence. We simplify complex multimodal model designs by converging on a singular focus: tokens, unlocking great potential for scaling both during training and inference. Our results demonstrate that next-token prediction is a promising path towards building general multimodal intelligence beyond language. We open-source key techniques and models to support further research in this direction.
Tips:
We advise users to set
processor.tokenizer.padding_side = "left"
before batched generation as it leads to more accurate results.Note that the model has been trained with a specific prompt format for chatting. Use
processor.apply_chat_template(my_conversation_dict)
to correctly format your prompts.Emu3 has two different checkpoints for image-generation and text-generation, make sure to use the correct checkpoint when loading the model. To generate an image, it is advised to use
prefix_constraints
so that the generated tokens are sampled only from possible image tokens. See more below for usage examples.
Emu3 implementation in Transformers uses a special image token to indicate where to merge image embeddings. The special image token isnβt new and uses one of the reserved tokens: <|extra_0|>
. You have to add <image>
to your prompt in the place where the image should be embedded for correct generation.
This model was contributed by RaushanTurganbay. The original code can be found here.
Usage example
Text generation inference
Hereβs how to load the model and perform inference in half-precision (torch.bfloat16
) to generate textual output from text or text and image inputs:
from transformers import Emu3Processor, Emu3ForConditionalGeneration
import torch
from PIL import Image
import requests
processor = Emu3Processor.from_pretrained("Emu3-community/Emu3-Chat-hf")
model = Emu3ForConditionalGeneration.from_pretrained("Emu3-community/Emu3-Chat-hf", torch_dtype=torch.bfloat16, device_map="cuda")
# prepare image and text prompt
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
prompt = "What do you see in this image?<image>"
inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=50)
print(processor.decode(output[0], skip_special_tokens=True))
Image generation inference
Emu3 can also generate images from textual input. Here is how you can do it:
processor = Emu3Processor.from_pretrained("Emu3-community/Emu3-Gen-hf")
model = Emu3ForConditionalGeneration.from_pretrained("Emu3-community/Emu3-Gen-hf", torch_dtype="bfloat16", device_map="auto", attn_implementation="flash_attention_2")
inputs = processor(
text=["a portrait of young girl. masterpiece, film grained, best quality.", "a dog running under the rain"],
padding=True,
return_tensors="pt",
return_for_image_generation=True,
)
inputs = inputs.to(device="cuda:0", dtype=torch.bfloat16)
neg_prompt = "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry."
neg_inputs = processor(text=[neg_prompt] * 2, return_tensors="pt").to(device="cuda:0")
image_sizes = inputs.pop("image_sizes")
HEIGHT, WIDTH = image_sizes[0]
VISUAL_TOKENS = model.vocabulary_mapping.image_tokens
def prefix_allowed_tokens_fn(batch_id, input_ids):
height, width = HEIGHT, WIDTH
visual_tokens = VISUAL_TOKENS
image_wrapper_token_id = torch.tensor([processor.tokenizer.image_wrapper_token_id], device=model.device)
eoi_token_id = torch.tensor([processor.tokenizer.eoi_token_id], device=model.device)
eos_token_id = torch.tensor([processor.tokenizer.eos_token_id], device=model.device)
pad_token_id = torch.tensor([processor.tokenizer.pad_token_id], device=model.device)
eof_token_id = torch.tensor([processor.tokenizer.eof_token_id], device=model.device)
eol_token_id = processor.tokenizer.encode("<|extra_200|>", return_tensors="pt")[0]
position = torch.nonzero(input_ids == image_wrapper_token_id, as_tuple=True)[0][0]
offset = input_ids.shape[0] - position
if offset % (width + 1) == 0:
return (eol_token_id, )
elif offset == (width + 1) * height + 1:
return (eof_token_id, )
elif offset == (width + 1) * height + 2:
return (eoi_token_id, )
elif offset == (width + 1) * height + 3:
return (eos_token_id, )
elif offset > (width + 1) * height + 3:
return (pad_token_id, )
else:
return visual_tokens
out = model.generate(
**inputs,
max_new_tokens=50_000, # make sure to have enough tokens for one image
prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,
return_dict_in_generate=True,
negative_prompt_ids=neg_inputs.input_ids, # indicate for Classifier-Free Guidance
negative_prompt_attention_mask=neg_inputs.attention_mask,
)
image = model.decode_image_tokens(out.sequences[:, inputs.input_ids.shape[1]: ], height=HEIGHT, width=WIDTH)
images = processor.postprocess(list(image.float()), return_tensors="PIL.Image.Image") # internally we convert to np but it's not supported in bf16 precision
for i, image in enumerate(images['pixel_values']):
image.save(f"result{i}.png")
Emu3Config
class transformers.Emu3Config
< source >( vq_config: typing.Union[typing.Dict, transformers.models.emu3.configuration_emu3.Emu3VQVAEConfig] = None text_config: typing.Union[typing.Dict, transformers.models.emu3.configuration_emu3.Emu3TextConfig] = None vocabulary_map: typing.Dict[int, int] = None **kwargs )
Parameters
- vq_config (
Union[Dict, Emu3VQVAEConfig]
, optional) — Emu3VQVAEConfig instance containing the configuration for the VQ-VAE model. - text_config (`Union[Dict, Emu3TextConfig]“, optional) — Emu3TextConfig instance containing the configuration for the language model.
- vocabulary_map (
dict
, optional) — A dictionary containing the vocabulary map from the tokenizer. Used to obtain tokens from the image inputs.
This is the configuration class to store the configuration of a Emu3Model
. It is used to instantiate a
emu3 model according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the
Emu3-community/Emu3-Chat-hf.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Emu3VQVAEConfig
class transformers.Emu3VQVAEConfig
< source >( codebook_size: int = 32768 embed_dim: int = 4 latent_channels: int = 4 double_latent: bool = False in_channels: int = 3 out_channels: int = 3 temporal_downsample_factor: int = 4 base_channels: int = 256 channel_multiplier: typing.List[int] = [1, 2, 2, 4] num_res_blocks: int = 2 attn_resolutions: typing.List[int] = [3] hidden_size: int = 1024 num_attention_heads: int = 1 attention_dropout: float = 0.0 **kwargs )
Parameters
- codebook_size (
int
, optional, defaults to 32768) — Codebook size of the VQ model. - embed_dim (
int
, optional, defaults to 4) — Dimension of the quantized vector in codebook. - latent_channels (
int
, optional, defaults to 4) — Dimension of the output channel of encoder and the input channel of decoder - double_latent (
bool
, optional, defaults toFalse
) — Whether double the output dim of the encoder. - in_channels (
int
, optional, defaults to 3) — Input channel of encoder. - out_channels (
int
, optional, defaults to 3) — Output channel of decoder. - temporal_downsample_factor (
int
, optional, defaults to 4) — Temporal downsample factor. - base_channels (
int
, optional, defaults to 256) — Basic channel number of the intermediate blocks. - channel_multiplier (
List[int]
, optional, defaults to[1, 2, 2, 4]
) — Channel scaling factor of the intermediate blocks. - num_res_blocks (
int
, optional, defaults to 2) — Residual block number in each stage. - attn_resolutions (
List[int]
, optional, defaults to[3]
) — Stage indices to apply attention. - hidden_size (
int
, optional, defaults to 1024) — Dimension of the hidden representations in the attention layer. - num_attention_heads (
int
, optional, defaults to 1) — Number of attention heads for each attention layer. - attention_dropout (
float
, optional, defaults to 0.0) — The dropout ratio for the attention probabilities.
This is the configuration class to store the configuration of a Emu3VQVAE. It is used to instantiate an VQ-VAE model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a configuration to the VQ model presented in Emu3 paper.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
>>> from transformers import Emu3VQVAE, Emu3VQVAEConfig
>>> # Initializing a video VQ model of Emu3 configuration
>>> configuration = Emu3VQVAEConfig()
>>> # Initializing a model from the Emu3 VQ model style configuration
>>> model = Emu3VQVAE(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
Emu3TextConfig
class transformers.Emu3TextConfig
< source >( vocab_size: int = 184622 hidden_size: int = 4096 intermediate_size: int = 14336 num_hidden_layers: int = 32 num_attention_heads: int = 32 num_key_value_heads: typing.Optional[int] = 8 hidden_act: str = 'silu' max_position_embeddings: int = 9216 rms_norm_eps: float = 1e-05 use_cache: bool = True pad_token_id: int = 151643 bos_token_id: int = 151849 eos_token_id: int = 151850 tie_word_embeddings: bool = False rope_theta: float = 1000000.0 rope_scaling: typing.Optional = None mlp_bias = False attention_bias = False attention_dropout: float = 0.1 initializer_range: float = 0.02 **kwargs )
Parameters
- vocab_size (
int
, optional, defaults to 184622) — Vocabulary size of the Emu3 model. Defines the number of different tokens that can be represented by theinputs_ids
passed when callingEmu3Model
- hidden_size (
int
, optional, defaults to 4096) — Dimension of the hidden representations. - intermediate_size (
int
, optional, defaults to 14336) — Dimension of the MLP representations. - num_hidden_layers (
int
, optional, defaults to 32) — Number of hidden layers in the Transformer decoder. - num_attention_heads (
int
, optional, defaults to 32) — Number of attention heads for each attention layer in the Transformer decoder. - num_key_value_heads (
int
, optional, defaults to 8) — This is the number of key_value heads that should be used to implement Grouped Query Attention. Ifnum_key_value_heads=num_attention_heads
, the model will use Multi Head Attention (MHA), ifnum_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details checkout [this paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
num_attention_heads`. - hidden_act (
str
orfunction
, optional, defaults to"silu"
) — The non-linear activation function (function or string) in the decoder. - max_position_embeddings (
int
, optional, defaults to 9216) — The maximum sequence length that this model might ever be used with. Emu supports up to 9216 tokens, - rms_norm_eps (
float
, optional, defaults to 1e-05) — The epsilon used by the rms normalization layers. - use_cache (
bool
, optional, defaults toTrue
) — Whether or not the model should return the last key/values attentions (not used by all models). Only relevant ifconfig.is_decoder=True
. - pad_token_id (
int
, optional, defaults to 151643) — Padding token id. - bos_token_id (
int
, optional, defaults to 151849) — Beginning of stream token id. - eos_token_id (
int
, optional, defaults to 151850) — End of stream token id. - tie_word_embeddings (
bool
, optional, defaults toFalse
) — Whether to tie weight embeddings - rope_theta (
float
, optional, defaults to 1000000.0) — The base period of the RoPE embeddings. - rope_scaling (
Dict
, optional) — Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type and you expect the model to work on longermax_position_embeddings
, we recommend you to update this value accordingly. Expected contents:rope_type
(str
): The sub-variant of RoPE to use. Can be one of [‘default’, ‘linear’, ‘dynamic’, ‘yarn’, ‘longrope’, ‘llama3’], with ‘default’ being the original RoPE implementation.factor
(float
, optional): Used with all rope types except ‘default’. The scaling factor to apply to the RoPE embeddings. In most scaling types, afactor
of x will enable the model to handle sequences of length x original maximum pre-trained length.original_max_position_embeddings
(int
, optional): Used with ‘dynamic’, ‘longrope’ and ‘llama3’. The original max position embeddings used during pretraining.attention_factor
(float
, optional): Used with ‘yarn’ and ‘longrope’. The scaling factor to be applied on the attention computation. If unspecified, it defaults to value recommended by the implementation, using thefactor
field to infer the suggested value.beta_fast
(float
, optional): Only used with ‘yarn’. Parameter to set the boundary for extrapolation (only) in the linear ramp function. If unspecified, it defaults to 32.beta_slow
(float
, optional): Only used with ‘yarn’. Parameter to set the boundary for interpolation (only) in the linear ramp function. If unspecified, it defaults to 1.short_factor
(List[float]
, optional): Only used with ‘longrope’. The scaling factor to be applied to short contexts (<original_max_position_embeddings
). Must be a list of numbers with the same length as the hidden size divided by the number of attention heads divided by 2long_factor
(List[float]
, optional): Only used with ‘longrope’. The scaling factor to be applied to long contexts (<original_max_position_embeddings
). Must be a list of numbers with the same length as the hidden size divided by the number of attention heads divided by 2low_freq_factor
(float
, optional): Only used with ‘llama3’. Scaling factor applied to low frequency components of the RoPEhigh_freq_factor
(float
, optional*): Only used with ‘llama3’. Scaling factor applied to high frequency components of the RoPE - mlp_bias (
bool
, optional, defaults toFalse
) — Whether to use a bias in up_proj, down_proj and gate_proj layers in the MLP layers. - attention_bias (
bool
, optional, defaults toFalse
) — Whether to use a bias in the query, key, value and output projection layers during self-attention. - attention_dropout (
float
, optional, defaults to 0.1) — The dropout ratio for the attention probabilities. - initializer_range (
float
, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
This is the configuration class to store the configuration of a Emu3TextModel. It is used to instantiate a emu3 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Emu3-community/Emu3-Chat-hf.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
>>> from transformers import Emu3Model, Emu3Config
>>> # Initializing a Emu3-community/Emu3-Chat-hf style configuration
>>> configuration = Emu3Config()
>>> # Initializing a model from the Emu3-community/Emu3-Chat-hf style configuration
>>> model = Emu3Model(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
Emu3Processor
class transformers.Emu3Processor
< source >( image_processor tokenizer chat_template = None **kwargs )
Parameters
- image_processor (Emu3ImageProcessor) — The image processor is a required input.
- tokenizer (
Emu3TokenizerFast
) — The tokenizer is a required input. - chat_template (
str
, optional) — A Jinja template which will be used to convert lists of messages in a chat into a tokenizable string.
Constructs a Emu3 processor which wraps a Emu3 image processor and a GPT2 tokenizer into a single processor.
Emu3Processor offers all the functionalities of Emu3ImageProcessor and GPT2TokenizerFast.
See the __call__()
and decode() for more information.
This method forwards all its arguments to Emu3TokenizerFastβs batch_decode(). Please refer to the docstring of this method for more information.
This method forwards all its arguments to Emu3TokenizerFastβs decode(). Please refer to the docstring of this method for more information.
Emu3ImageProcessor
class transformers.Emu3ImageProcessor
< source >( do_resize: bool = True resample: Resampling = <Resampling.BICUBIC: 3> do_rescale: bool = True rescale_factor: typing.Union[int, float] = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None do_convert_rgb: bool = True do_pad: bool = True min_pixels: int = 262144 max_pixels: int = 1048576 spatial_factor: int = 8 **kwargs )
Parameters
- do_resize (
bool
, optional, defaults toTrue
) — Whether to resize the image’s (height, width) dimensions. - resample (
PILImageResampling
, optional, defaults toResampling.BICUBIC
) — Resampling filter to use when resizing the image. - do_rescale (
bool
, optional, defaults toTrue
) — Whether to rescale the image by the specified scalerescale_factor
. - rescale_factor (
int
orfloat
, optional, defaults to1/255
) — Scale factor to use if rescaling the image. - do_normalize (
bool
, optional, defaults toTrue
) — Whether to normalize the image. - image_mean (
float
orList[float]
, optional, defaults to[0.48145466, 0.4578275, 0.40821073]
) — Mean to use if normalizing the image. This is a float or list of floats for each channel in the image. - image_std (
float
orList[float]
, optional, defaults to[0.26862954, 0.26130258, 0.27577711]
) — Standard deviation to use if normalizing the image. This is a float or list of floats for each channel in the image. - do_convert_rgb (
bool
, optional, defaults toTrue
) — Whether to convert the image to RGB. - do_pad (
bool
, optional, defaults toTrue
) — Whether to pad the image. IfTrue
, will pad the patch dimension of the images in the batch to the largest number of patches in the batch. Padding will be applied to the bottom and right with zeros. - min_pixels (
int
, optional, defaults to512 * 512
) — The min pixels of the image to resize the image. - max_pixels (
int
, optional, defaults to1024 * 1024
) — The max pixels of the image to resize the image. - spatial_factor (
int
, optional, defaults to 8) — The spatial downsample factor the image will be downsampled in feature extracting phase
Constructs a Emu3 image processor that dynamically resizes images based on the original images.
preprocess
< source >( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), typing.List[ForwardRef('PIL.Image.Image')], typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')]] do_resize: bool = None size: typing.Dict[str, int] = None resample: Resampling = None do_rescale: bool = None rescale_factor: float = None do_normalize: bool = None image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None do_convert_rgb: bool = None do_pad: bool = True return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: typing.Optional[transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None )
Parameters
- images (
ImageInput
) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, setdo_rescale=False
. - do_resize (
bool
, optional, defaults toself.do_resize
) — Whether to resize the image. - size (
Dict[str, int]
, optional, defaults toself.size
) — Size of the image after resizing. Shortest edge of the image is resized to size[“shortest_edge”], with the longest edge resized to keep the input aspect ratio. - resample (
int
, optional, defaults toself.resample
) — Resampling filter to use if resizing the image. This can be one of the enumPILImageResampling
. Only has an effect ifdo_resize
is set toTrue
. - do_rescale (
bool
, optional, defaults toself.do_rescale
) — Whether to rescale the image. - rescale_factor (
float
, optional, defaults toself.rescale_factor
) — Rescale factor to rescale the image by ifdo_rescale
is set toTrue
. - do_normalize (
bool
, optional, defaults toself.do_normalize
) — Whether to normalize the image. - image_mean (
float
orList[float]
, optional, defaults toself.image_mean
) — Image mean to use for normalization. Only has an effect ifdo_normalize
is set toTrue
. - image_std (
float
orList[float]
, optional, defaults toself.image_std
) — Image standard deviation to use for normalization. Only has an effect ifdo_normalize
is set toTrue
. - do_convert_rgb (
bool
, optional, defaults toself.do_convert_rgb
) — Whether to convert the image to RGB. - do_pad (
bool
, optional, defaults toTrue
) — Whether to pad the image. IfTrue
, will pad the patch dimension of the images in the batch to the largest number of patches in the batch. Padding will be applied to the bottom and right with zeros. - return_tensors (
str
orTensorType
, optional) — The type of tensors to return. Can be one of:- Unset: Return a list of
np.ndarray
. TensorType.TENSORFLOW
or'tf'
: Return a batch of typetf.Tensor
.TensorType.PYTORCH
or'pt'
: Return a batch of typetorch.Tensor
.TensorType.NUMPY
or'np'
: Return a batch of typenp.ndarray
.TensorType.JAX
or'jax'
: Return a batch of typejax.numpy.ndarray
.
- Unset: Return a list of
- data_format (
ChannelDimension
orstr
, optional, defaults toChannelDimension.FIRST
) — The channel dimension format for the output image. Can be one of:"channels_first"
orChannelDimension.FIRST
: image in (num_channels, height, width) format."channels_last"
orChannelDimension.LAST
: image in (height, width, num_channels) format.- Unset: Use the channel dimension format of the input image.
- input_data_format (
ChannelDimension
orstr
, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:"channels_first"
orChannelDimension.FIRST
: image in (num_channels, height, width) format."channels_last"
orChannelDimension.LAST
: image in (height, width, num_channels) format."none"
orChannelDimension.NONE
: image in (height, width) format.
Emu3VQVAE
class transformers.Emu3VQVAE
< source >( config: Emu3VQVAEConfig )
Parameters
- config (Emu3VQVAEConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The VQ-VAE model used in Emu3 for encoding/decoding images into discrete tokens. This model follows the βMake-a-scene: Scene-based text-to-image generation with human priorsβ paper from Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
Define the computation performed at every call.
Should be overridden by all subclasses.
Although the recipe for forward pass needs to be defined within
this function, one should call the Module
instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
Emu3TextModel
class transformers.Emu3TextModel
< source >( config: Emu3Config )
Parameters
- config (Emu3Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
- config — Emu3TextConfig
The bare Emu3Text Model outputting raw hidden-states without any specific head on top. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
Transformer decoder consisting of config.num_hidden_layers layers. Each layer is a Emu3TextDecoderLayer
forward
< source >( input_ids: LongTensor = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None **flash_attn_kwargs: typing_extensions.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs] )
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
If
past_key_values
is used, optionally only the lastinput_ids
have to be input (seepast_key_values
).If you want to change padding behavior, you should read
modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more information on the default strategy.- 1 indicates the head is not masked,
- 0 indicates the head is masked.
- position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]
. - past_key_values (
Cache
ortuple(tuple(torch.FloatTensor))
, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_values
returned by the model at a previous stage of decoding, whenuse_cache=True
orconfig.use_cache=True
.Two formats are allowed:
- a Cache instance, see our kv cache guide;
- Tuple of
tuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
). This is also known as the legacy cache format.
The model will output the same cache format that is fed as input. If no
past_key_values
are passed, the legacy cache format will be returned.If
past_key_values
are used, the user can optionally input only the lastinput_ids
(those that don’t have their past key value states given to this model) of shape(batch_size, 1)
instead of allinput_ids
of shape(batch_size, sequence_length)
. - inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix. - use_cache (
bool
, optional) — If set toTrue
,past_key_values
key value states are returned and can be used to speed up decoding (seepast_key_values
). - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple. - cache_position (
torch.LongTensor
of shape(sequence_length)
, optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids
, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length.
The Emu3TextModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Emu3ForCausalLM
forward
< source >( input_ids: LongTensor = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Union[transformers.cache_utils.Cache, typing.List[torch.FloatTensor], NoneType] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None num_logits_to_keep: int = 0 **kwargs: typing_extensions.Unpack[transformers.models.emu3.modeling_emu3.KwargsForCausalLM] ) β transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
If
past_key_values
is used, optionally only the lastinput_ids
have to be input (seepast_key_values
).If you want to change padding behavior, you should read
modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more information on the default strategy.- 1 indicates the head is not masked,
- 0 indicates the head is masked.
- position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]
. - past_key_values (
Cache
, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_values
returned by the model at a previous stage of decoding, whenuse_cache=True
orconfig.use_cache=True
.Has to be an instance of Cache instance, see our kv cache guide.
The model will output the same cache type that is fed as input. If no
past_key_values
are passed, the legacy cache format will be returned.If
past_key_values
are used, the user can optionally input only the lastinput_ids
(those that don’t have their past key value states given to this model) of shape(batch_size, 1)
instead of allinput_ids
of shape(batch_size, sequence_length)
. - inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix. - use_cache (
bool
, optional) — If set toTrue
,past_key_values
key value states are returned and can be used to speed up decoding (seepast_key_values
). - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple. - cache_position (
torch.LongTensor
of shape(sequence_length)
, optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids
, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length. - Args —
labels (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional): Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]
or -100 (seeinput_ids
docstring). Tokens with indices set to-100
are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size]
. num_logits_to_keep (int
, optional): Calculate logits for the lastnum_logits_to_keep
tokens. If0
, calculate logits for allinput_ids
(special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size.
Returns
transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
A transformers.modeling_outputs.CausalLMOutputWithPast or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (Emu3TextConfig) and inputs.
-
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) β Language modeling loss (for next-token prediction). -
logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) β Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). -
past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) β Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
)Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding. -
hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) β Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) β Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The Emu3ForCausalLM forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import Emu3Processor, Emu3ForConditionalGeneration
>>> import torch
>>> import requests
>>> from PIL import Image
>>> model = Emu3ForCausalLM.from_pretrained("Emu3-community/Emu3-Chat-hf", torch_dtype=torch.bfloat16)
>>> processor = Emu3Processor.from_pretrained("Emu3-community/Emu3-Chat-hf")
>>> inputs = processor(text=["Can you write me a poem about winter."], return_tensors="pt").to(model.device)
>>> generated_ids = model.generate(**inputs, max_new_tokens=100, do_sample=False)
>>> processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
Emu3ForConditionalGeneration
forward
< source >( input_ids: LongTensor = None pixel_values: FloatTensor = None image_sizes: Tensor = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None labels: typing.Optional[torch.LongTensor] = None num_logits_to_keep: int = 0 ) β transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
If
past_key_values
is used, optionally only the lastinput_ids
have to be input (seepast_key_values
).If you want to change padding behavior, you should read
modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more information on the default strategy.- 1 indicates the head is not masked,
- 0 indicates the head is masked.
- position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]
. - past_key_values (
Cache
ortuple(tuple(torch.FloatTensor))
, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_values
returned by the model at a previous stage of decoding, whenuse_cache=True
orconfig.use_cache=True
.Two formats are allowed:
- a Cache instance, see our kv cache guide;
- Tuple of
tuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
). This is also known as the legacy cache format.
The model will output the same cache format that is fed as input. If no
past_key_values
are passed, the legacy cache format will be returned.If
past_key_values
are used, the user can optionally input only the lastinput_ids
(those that don’t have their past key value states given to this model) of shape(batch_size, 1)
instead of allinput_ids
of shape(batch_size, sequence_length)
. - inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix. - use_cache (
bool
, optional) — If set toTrue
,past_key_values
key value states are returned and can be used to speed up decoding (seepast_key_values
). - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple. - cache_position (
torch.LongTensor
of shape(sequence_length)
, optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids
, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length. - Args —
labels (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional): Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]
or -100 (seeinput_ids
docstring). Tokens with indices set to-100
are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size]
. num_logits_to_keep (int
, optional): Calculate logits for the lastnum_logits_to_keep
tokens. If0
, calculate logits for allinput_ids
(special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size.
Returns
transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
A transformers.modeling_outputs.CausalLMOutputWithPast or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (Emu3Config) and inputs.
-
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) β Language modeling loss (for next-token prediction). -
logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) β Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). -
past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) β Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
)Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding. -
hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) β Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) β Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The Emu3ForConditionalGeneration forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import Emu3Processor, Emu3ForConditionalGeneration
>>> import torch
>>> import requests
>>> from PIL import Image
>>> model = Emu3ForConditionalGeneration.from_pretrained("Emu3-community/Emu3-Chat-hf", torch_dtype=torch.bfloat16)
>>> processor = Emu3Processor.from_pretrained("Emu3-community/Emu3-Chat-hf")
>>> conversation = [
... {
... "role": "system",
... "content": [
... {"type": "text", "text": "You are a helpful assistant."},
... ],
... },
... {
... "role": "user",
... "content": [
... {"type": "image"},
... {"type": "text", "text": "Please describe the image."},
... ],
... },
... ]
>>> prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
>>> image = Image.open(requests.get("https://www.ilankelman.org/stopsigns/australia.jpg", stream=True).raw)
>>> inputs = processor(images=[image], text=[prompt], return_tensors="pt").to(model.device, torch.bfloat16)
>>> generated_ids = model.generate(**inputs, max_new_tokens=100, do_sample=False)
>>> processor.batch_decode(generated_ids, skip_special_tokens=True)[0]