teowu commited on
Commit
2cb53cb
·
verified ·
1 Parent(s): e9c5390

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +128 -0
README.md ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ pipeline_tag: image-text-to-text
7
+ tags:
8
+ - multimodal
9
+ - aria
10
+ ---
11
+ <!-- <p align="center">
12
+ <br>Aria</br>
13
+ </p> -->
14
+
15
+
16
+ # Aria-Base-8K Model Card
17
+
18
+ This model is a part of Aria-Base model series, designed for research studies and fine-tuning.
19
+
20
+ <!--
21
+ - Aria is the **first open multimodal native MoE** model, capable of seamlessly handling various input modalities within a MoE architecture.
22
+ - Aria performs **on par with GPT-4o mini and Gemini 1.5 Flash** across a range of multimodal tasks while maintaining strong performance on **text**-only tasks.
23
+ - Compared to similar or even larger models, Aria boasts **faster speeds** and **lower costs**. This high efficiency stems from its ability to activate only 3.9B parameters during inference – the **fewest** among models with comparable performance.
24
+ -->
25
+
26
+ ## Aria-Base-8K
27
+
28
+ - **Pretrain Base Model**: This model corresponds to the model checkpoint after the multimodal pre-training stage, with 1.4T tokens (1T language + 400B multimodal) trained in this stage. This stage lasts 43,000 iterations, with all sequences packed to 8192 with Megatron-LM, with global batch size 4096. During this training stage, the learning rate decays from `8.75e-5` to `3.5e-5`.
29
+ - **Appropriate for Continue Pre-training**: This model is released for continue pre-training, *e.g.* on domain-specific pre-training data (OCR, long-context, agent). In Aria, this checkpoint is further continue-pretrained with 64K long-context multimodal data, yielding [Aria-Base-64K](https://huggingface.co/teowu/Aria-Base-64K).
30
+ - **Strong Base Performance on Language and Multimodal Scenarios**: This model shows excellent base performance on knowledge-related evaluations on both pure language and multimodal scenarios (MMLU 70+, MMMU 50+, *etc*).
31
+ - ***Limited Ability on Long-context Scenarios***: This model is only trained with 8K context length, and is not expected to show best performance with context length especially longer than 8K (e.g. a video with >100 frames). [Aria-Base-64K](https://huggingface.co/teowu/Aria-Base-64K) is more appropriate for longer sequence understanding.
32
+ - ***Limited Chat Template Availability***: This model is trained with a very low percentage of data (around 3%) re-formatted with the chat template. Hence, it might not be optimal to be directly tested with various benchmarks.
33
+
34
+ <p align="center">
35
+ 🔗 <a href="https://rhymes.ai/" target="_blank"> Try Aria!</a> · 📖 <a href="https://www.rhymes.ai/blog-details/aria-first-open-multimodal-native-moe-model" target="_blank">Blog</a> · 📌 <a href="https://arxiv.org/pdf/2410.05993" target="_blank">Paper</a>
36
+ · ⭐ <a href="https://github.com/rhymes-ai/Aria" target="_blank">GitHub</a> · 🟣 <a href="https://discord.com/invite/u8HxU23myj" target="_blank"> Discord </a>
37
+ </p>
38
+
39
+
40
+ <!-- # Model Info
41
+
42
+ | Model | Download | Parameter | Context Length |
43
+ | :---- | :------- | :------------ | :------ |
44
+ | Aria | < HF link - TBD> | • Activation: 3.9B (3.5B MoE + 0.4B Visual Encoder) <br> • Total: 25.3B | 64K | -->
45
+
46
+ ## Benchmark
47
+
48
+ N/A.
49
+
50
+ ## Quick Start
51
+ ### Installation
52
+ ```
53
+ pip install transformers==4.45.0 accelerate==0.34.1 sentencepiece==0.2.0 torchvision requests torch Pillow
54
+ pip install flash-attn --no-build-isolation
55
+
56
+ # For better inference performance, you can install grouped-gemm, which may take 3-5 minutes to install
57
+ pip install grouped_gemm==0.1.6
58
+ ```
59
+
60
+ ### Inference
61
+
62
+ You can use the same method as the final Aria model to load this checkpoint. However, as the base model, it might not be able to yield optimal chat performance.
63
+
64
+ ```python
65
+ import requests
66
+ import torch
67
+ from PIL import Image
68
+ from transformers import AutoModelForCausalLM, AutoProcessor
69
+
70
+ model_id_or_path = "teowu/Aria-Base-8K"
71
+
72
+ model = AutoModelForCausalLM.from_pretrained(model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
73
+
74
+ processor = AutoProcessor.from_pretrained(model_id_or_path, trust_remote_code=True)
75
+
76
+ image_path = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"
77
+
78
+ image = Image.open(requests.get(image_path, stream=True).raw)
79
+
80
+ messages = [
81
+ {
82
+ "role": "user",
83
+ "content": [
84
+ {"text": None, "type": "image"},
85
+ {"text": "what is the image?", "type": "text"},
86
+ ],
87
+ }
88
+ ]
89
+
90
+ text = processor.apply_chat_template(messages, add_generation_prompt=True)
91
+ inputs = processor(text=text, images=image, return_tensors="pt")
92
+ inputs["pixel_values"] = inputs["pixel_values"].to(model.dtype)
93
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
94
+
95
+ with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
96
+ output = model.generate(
97
+ **inputs,
98
+ max_new_tokens=500,
99
+ stop_strings=["<|im_end|>"],
100
+ tokenizer=processor.tokenizer,
101
+ do_sample=True,
102
+ temperature=0.9,
103
+ )
104
+ output_ids = output[0][inputs["input_ids"].shape[1]:]
105
+ result = processor.decode(output_ids, skip_special_tokens=True)
106
+
107
+ print(result)
108
+ ```
109
+
110
+ ### Advanced Inference and Fine-tuning
111
+
112
+ We provide a [codebase](https://github.com/rhymes-ai/Aria) for more advanced usage of Aria,
113
+ including vllm inference, cookbooks, and fine-tuning on custom datasets.
114
+
115
+ As it shares the same structure with the final model,
116
+ you may just replace the `rhymes-ai/Aria` to this model path for any advanced inference and fine-tuning.
117
+
118
+
119
+ ## Citation
120
+ If you find our work helpful, please consider citing.
121
+ ```
122
+ @article{aria,
123
+ title={Aria: An Open Multimodal Native Mixture-of-Experts Model},
124
+ author={Dongxu Li and Yudong Liu and Haoning Wu and Yue Wang and Zhiqi Shen and Bowen Qu and Xinyao Niu and Guoyin Wang and Bei Chen and Junnan Li},
125
+ year={2024},
126
+ journal={arXiv preprint arXiv:2410.05993},
127
+ }
128
+ ```