BAAI
/

ryanzhangfan RaushanTurganbay HF staff commited on
Commit
bd6ad57
·
verified ·
1 Parent(s): 443d34d

Update README.md (#1)

Browse files

- Update README.md (f5b654a51eec488ab786cf6f1e7c968949871340)


Co-authored-by: Raushan Turganbay <[email protected]>

Files changed (1) hide show
  1. README.md +143 -3
README.md CHANGED
@@ -1,3 +1,143 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ pipeline_tag: image-text-to-text
6
+ inference: false
7
+ tags:
8
+ - vision
9
+ - image-text-to-text
10
+ arxiv: 2409.18869
11
+ ---
12
+
13
+
14
+ <div align='center'>
15
+ <h1>Emu3: Next-Token Prediction is All You Need</h1h1>
16
+ <h3></h3>
17
+
18
+ [Emu3 Team, BAAI](https://www.baai.ac.cn/english.html)
19
+
20
+ </div>
21
+
22
+
23
+ <div align='left'>
24
+ <img src="https://github.com/baaivision/Emu3/blob/main/assets/arch.png?raw=True" class="interpolation-image" alt="arch." height="80%" width="70%" />
25
+ </div>
26
+
27
+ Below is the model card of Emu3-Chat model, which is adapted from the original Emu3 model card that you can find [here](https://huggingface.co/BAAI/Emu3-Gen).
28
+
29
+
30
+ ## Model details
31
+
32
+ **Model type:**
33
+ Emu3 is an open-source multimodal models trained with next-token prediction task. By tokenizing images and text into a discrete space, Emu3 is trained as a single transformer from scratch on a mixture of multimodal sequences.
34
+ It is an auto-regressive language model, based on the transformer architecture.
35
+
36
+ **Paper or resources for more information:**
37
+ https://github.com/baaivision/Emu3
38
+
39
+
40
+ ## Highlights
41
+
42
+ - **Emu3** is capable of generating high-quality images following the text input, by simply predicting the next vision token. The model naturally supports flexible resolutions and styles.
43
+ - **Emu3** shows strong vision-language understanding capabilities to see the physical world and provides coherent text responses. Notably, this capability is achieved without depending on a CLIP and a pretrained LLM.
44
+ - **Emu3** simply generates a video causally by predicting the next token in a video sequence, unlike the video diffusion model as in Sora. With a video in context, Emu3 can also naturally extend the video and predict what will happen next.
45
+ - **Emu3** outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship open models such as SDXL, LLaVA-1.6 and OpenSora-1.2, while eliminating the need for diffusion or compositional architectures.
46
+
47
+ ## How to use the model
48
+
49
+ First, make sure to have `transformers >= 4.48.0`.
50
+ Make sure also to follow the correct prompt template (`USER: xxxASSISTANT:`) and add the token `<image>` to the location where you want to query images:
51
+
52
+
53
+ ### Using `pipeline`:
54
+
55
+
56
+ ```python
57
+ from transformers import pipeline
58
+
59
+ pipe = pipeline("image-text-to-text", model="BAAI/Emu3-Chat-hf")
60
+ messages = [
61
+ {
62
+ "role": "user",
63
+ "content": [
64
+ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"},
65
+ {"type": "text", "text": "What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud"},
66
+ ],
67
+ },
68
+ ]
69
+
70
+ out = pipe(text=messages, max_new_tokens=20)
71
+ print(out)
72
+ >>> [{'input_text': [{'role': 'user', 'content': [{'type': 'image', 'url': 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg'}, {'type': 'text', 'text': 'What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud'}]}], 'generated_text': 'Lava'}]
73
+ ```
74
+
75
+ ### Using pure `transformers`:
76
+
77
+ Below is an example script to run generation in `float16` precision on a GPU device:
78
+
79
+ ```python
80
+ import requests
81
+ from PIL import Image
82
+
83
+ import torch
84
+ from transformers import AutoProcessor, Emu3ForConditionalGeneration
85
+
86
+ model_id = "BAAI/Emu3-Chat-hf"
87
+ model = Emu3ForConditionalGeneration.from_pretrained(
88
+ model_id,
89
+ torch_dtype=torch.float16,
90
+ low_cpu_mem_usage=True,
91
+ device_map="cuda:0",
92
+ )
93
+
94
+ processor = AutoProcessor.from_pretrained(model_id)
95
+
96
+ # Define a chat history and use `apply_chat_template` to get correctly formatted prompt
97
+ # Each value in "content" has to be a list of dicts with types ("text", "image")
98
+ conversation = [
99
+ {
100
+
101
+ "role": "user",
102
+ "content": [
103
+ {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
104
+ {"type": "text", "text": "What are these?"},
105
+ ],
106
+ },
107
+ ]
108
+ inputs_dict = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=True, return_dict=True)
109
+ inputs_dict = inputs_dict.to(0, torch.float16)
110
+
111
+ output = model.generate(**inputs_dict, max_new_tokens=50, do_sample=False)
112
+ print(processor.decode(output[0][2:], skip_special_tokens=True))
113
+ ```
114
+
115
+ ### Model optimization
116
+
117
+
118
+ #### Use Flash-Attention 2 to further speed-up generation
119
+
120
+ First make sure to install `flash-attn`. Refer to the [original repository of Flash Attention](https://github.com/Dao-AILab/flash-attention) regarding that package installation. Simply change the snippet above with:
121
+
122
+ ```diff
123
+ model = Emu3ForConditionalGeneration.from_pretrained(
124
+ model_id,
125
+ torch_dtype=torch.float16,
126
+ low_cpu_mem_usage=True,
127
+ + attn_implementation="flash_attention_2",
128
+ device_map="cuda:0",
129
+ )
130
+ ```
131
+
132
+ # Citation
133
+ ```
134
+ @misc{wang2024emu3nexttokenpredictionneed,
135
+ title={Emu3: Next-Token Prediction is All You Need},
136
+ author={Xinlong Wang and Xiaosong Zhang and Zhengxiong Luo and Quan Sun and Yufeng Cui and Jinsheng Wang and Fan Zhang and Yueze Wang and Zhen Li and Qiying Yu and Yingli Zhao and Yulong Ao and Xuebin Min and Tao Li and Boya Wu and Bo Zhao and Bowen Zhang and Liangdong Wang and Guang Liu and Zheqi He and Xi Yang and Jingjing Liu and Yonghua Lin and Tiejun Huang and Zhongyuan Wang},
137
+ year={2024},
138
+ eprint={2409.18869},
139
+ archivePrefix={arXiv},
140
+ primaryClass={cs.CV},
141
+ url={https://arxiv.org/abs/2409.18869},
142
+ }
143
+ ```