teowu commited on
Commit
57f0058
·
verified ·
1 Parent(s): 543607c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -4
README.md CHANGED
@@ -7,6 +7,8 @@ pipeline_tag: image-text-to-text
7
  tags:
8
  - multimodal
9
  - aria
 
 
10
  ---
11
  <!-- <p align="center">
12
  <br>Aria</br>
@@ -25,7 +27,7 @@ This checkpoint is one of base models of [Aria](https://huggingface.co/rhymes-ai
25
 
26
  <img src="./aria-stages.png" alt="Aria Training Stages" style="width: 100%;">
27
 
28
- Aria-Base-64K is fine-tuned from [Aria-Base-8K](https://huggingface.co/teowu/Aria-Base-8K).
29
 
30
  <!--
31
  - Aria is the **first open multimodal native MoE** model, capable of seamlessly handling various input modalities within a MoE architecture.
@@ -33,12 +35,12 @@ Aria-Base-64K is fine-tuned from [Aria-Base-8K](https://huggingface.co/teowu/Ari
33
  - Compared to similar or even larger models, Aria boasts **faster speeds** and **lower costs**. This high efficiency stems from its ability to activate only 3.9B parameters during inference – the **fewest** among models with comparable performance.
34
  -->
35
 
36
- ## Aria-Base-8K
37
 
38
  - **Base Model After Long-Context Pre-training**: This model corresponds to the model checkpoint after the long-context pre-training stage, with 33B tokens (21B multimodal, 12B language, 69% in long-form) trained in this stage. This stage lasts 1,000 iterations, with all sequences packed to 65536 with Megatron-LM, with global batch size 512. During this training stage, the learning rate keeps constant at `3.5e-5`.
39
  - **Appropriate for Video and Long-document Fine-tuning**: This model is recommended for long-form continue pre-training or fine-tuning, e.g. on video QA datasets or long-document QA datasets. While resource is limited, it is also possible to post-train this model with short instruction tuning datasets and transfer to long-form QA scenarios.
40
  - **Understanding on Hundreds of Images**: This model is capable of understanding up to 250 high-resolution images or up to 500 mid-resolution images.
41
- - **Strong Base Performance on Language and Multimodal Scenarios**: This model retains strong base performance as [Aria-Base-8K](https://huggingface.co/teowu/Aria-Base-8K).
42
  - ***Limited Chat Template Availability***: This model is trained with a very low percentage of data (around 3%) re-formatted with the chat template. Hence, it might not be optimal to be directly used with chat templates.
43
 
44
  <!-- # Model Info
@@ -68,7 +70,7 @@ import torch
68
  from PIL import Image
69
  from transformers import AutoModelForCausalLM, AutoProcessor
70
 
71
- model_id_or_path = "teowu/Aria-Base-64K"
72
 
73
  model = AutoModelForCausalLM.from_pretrained(model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
74
 
 
7
  tags:
8
  - multimodal
9
  - aria
10
+ base_model:
11
+ - rhymes-ai/Aria-Base-8K
12
  ---
13
  <!-- <p align="center">
14
  <br>Aria</br>
 
27
 
28
  <img src="./aria-stages.png" alt="Aria Training Stages" style="width: 100%;">
29
 
30
+ Aria-Base-64K is fine-tuned from [Aria-Base-8K](https://huggingface.co/rhymes-ai/Aria-Base-8K).
31
 
32
  <!--
33
  - Aria is the **first open multimodal native MoE** model, capable of seamlessly handling various input modalities within a MoE architecture.
 
35
  - Compared to similar or even larger models, Aria boasts **faster speeds** and **lower costs**. This high efficiency stems from its ability to activate only 3.9B parameters during inference – the **fewest** among models with comparable performance.
36
  -->
37
 
38
+ ## Aria-Base-64K
39
 
40
  - **Base Model After Long-Context Pre-training**: This model corresponds to the model checkpoint after the long-context pre-training stage, with 33B tokens (21B multimodal, 12B language, 69% in long-form) trained in this stage. This stage lasts 1,000 iterations, with all sequences packed to 65536 with Megatron-LM, with global batch size 512. During this training stage, the learning rate keeps constant at `3.5e-5`.
41
  - **Appropriate for Video and Long-document Fine-tuning**: This model is recommended for long-form continue pre-training or fine-tuning, e.g. on video QA datasets or long-document QA datasets. While resource is limited, it is also possible to post-train this model with short instruction tuning datasets and transfer to long-form QA scenarios.
42
  - **Understanding on Hundreds of Images**: This model is capable of understanding up to 250 high-resolution images or up to 500 mid-resolution images.
43
+ - **Strong Base Performance on Language and Multimodal Scenarios**: This model retains strong base performance as [Aria-Base-8K](https://huggingface.co/rhymes-ai/Aria-Base-8K).
44
  - ***Limited Chat Template Availability***: This model is trained with a very low percentage of data (around 3%) re-formatted with the chat template. Hence, it might not be optimal to be directly used with chat templates.
45
 
46
  <!-- # Model Info
 
70
  from PIL import Image
71
  from transformers import AutoModelForCausalLM, AutoProcessor
72
 
73
+ model_id_or_path = "rhymes-ai/Aria-Base-64K"
74
 
75
  model = AutoModelForCausalLM.from_pretrained(model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
76