File size: 3,725 Bytes
83c6c9e
 
 
 
 
 
 
0af43f8
83c6c9e
df31b96
10fe8c6
13d60ce
5df5eb0
10fe8c6
83c6c9e
10fe8c6
 
609a34f
 
552887a
369a07e
609a34f
df31b96
 
 
 
 
 
 
83c6c9e
 
 
609a34f
df31b96
 
24e17ae
 
 
 
4abfe6b
 
 
 
 
 
 
 
 
 
 
 
 
df31b96
4abfe6b
 
 
e11a252
4abfe6b
 
 
 
e11a252
4abfe6b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
df31b96
4abfe6b
 
 
 
 
 
01c0883
4abfe6b
 
 
df31b96
 
 
 
 
 
 
 
 
2a17e81
 
 
 
 
 
367afe9
 
1ea99cf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
language:
- en
tags:
- llava
- multimodal
- qwen
license: apache-2.0
---
# nanoLLaVA - Sub 1B Vision-Language Model

**IMPORTANT**: **nanoLLaVA-1.5** is out with a much better performance. Please find it [here](https://huggingface.co/qnguyen3/nanoLLaVA-1.5).

<p align="center">
  <img src="https://i.postimg.cc/d15k3YNG/nanollava.webp" alt="Logo" width="350">
</p>

## Description
nanoLLaVA is a "small but mighty" 1B vision-language model designed to run efficiently on edge devices.
- **Base LLM**: [Quyen-SE-v0.1](https://huggingface.co/vilm/Quyen-SE-v0.1) (Qwen1.5-0.5B)
- **Vision Encoder**: [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)

| Model   | **VQA v2** | **TextVQA** | **ScienceQA** | **POPE** | **MMMU (Test)** | **MMMU (Eval)** | **GQA**  | **MM-VET** |
|---------|--------|---------|-----------|------|-------------|-------------|------|--------|
| Score   | 70.84  | 46.71   | 58.97     | 84.1 | 28.6        | 30.4        | 54.79| 23.9   |

## Training Data
Training Data will be released later as I am still writing a paper on this. Expect the final final to be much more powerful than the current one.

## Finetuning Code
Coming Soon!!!

## Usage
You can use with `transformers` with the following script:

```bash
pip install -U transformers accelerate flash_attn
```

```python
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import warnings

# disable some warnings
transformers.logging.set_verbosity_error()
transformers.logging.disable_progress_bar()
warnings.filterwarnings('ignore')

# set device
torch.set_default_device('cuda')  # or 'cpu'

# create model
model = AutoModelForCausalLM.from_pretrained(
    'qnguyen3/nanoLLaVA',
    torch_dtype=torch.float16,
    device_map='auto',
    trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
    'qnguyen3/nanoLLaVA',
    trust_remote_code=True)

# text prompt
prompt = 'Describe this image in detail'

messages = [
    {"role": "user", "content": f'<image>\n{prompt}'}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

print(text)

text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')]
input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=torch.long).unsqueeze(0)

# image, sample images can be found in images folder
image = Image.open('/path/to/image.png')
image_tensor = model.process_images([image], model.config).to(dtype=model.dtype)

# generate
output_ids = model.generate(
    input_ids,
    images=image_tensor,
    max_new_tokens=2048,
    use_cache=True)[0]

print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())
```

## Prompt Format
The model follow the ChatML standard, however, without `\n` at the end of `<|im_end|>`:
```
<|im_start|>system
Answer the question<|im_end|><|im_start|>user
<image>
What is the picture about?<|im_end|><|im_start|>assistant
```

---
| Image                                | Example                                                                                     |
|--------------------------------------|---------------------------------------------------------------------------------------------|
| ![small](example_1.png)              | **What is the text saying?** <br> "Small but mighty". <br>**How does the text correlate to the context of the image?** <br> The text seems to be a playful or humorous representation of a small but mighty figure, possibly a mouse or a mouse toy, holding a weightlifting bar. |
---

Model is trained using a modified version from [Bunny](https://github.com/BAAI-DCAI/Bunny/tree/main/bunny)