nanoLLaVA-1.5 - Improved sub 1B Vision-Language Model

Logo

Description

nanoLLaVA-1.5 is a "small but mighty" 1B vision-language model designed to run efficiently on edge devices. This is an update from the v1.0 version qnguyen3/nanoLLaVA

Model VQA v2 TextVQA ScienceQA POPE MMMU (Test) MMMU (Eval) GQA MM-VET
nanoLLavA-1.0 70.84 46.71 58.97 84.1 28.6 30.4 54.79 23.9
nanoLLavA-1.5 TBD TBD TBD TBD TBD TBD TBD TBD

Training Data

Training Data will be released later as I am still writing a paper on this. Expect the final final to be much more powerful than the current one.

Finetuning Code

Coming Soon!!!

Usage

You can use with transformers with the following script:

pip install -U transformers accelerate flash_attn
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import warnings

# disable some warnings
transformers.logging.set_verbosity_error()
transformers.logging.disable_progress_bar()
warnings.filterwarnings('ignore')

# set device
torch.set_default_device('cuda')  # or 'cpu'

model_name = 'qnguyen3/nanoLLaVA-1.5'

# create model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map='auto',
    trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True)

# text prompt
prompt = 'Describe this image in detail'

messages = [
    {"role": "user", "content": f'<image>\n{prompt}'}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

print(text)

text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')]
input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=torch.long).unsqueeze(0)

# image, sample images can be found in images folder
image = Image.open('/path/to/image.png')
image_tensor = model.process_images([image], model.config).to(dtype=model.dtype)

# generate
output_ids = model.generate(
    input_ids,
    images=image_tensor,
    max_new_tokens=2048,
    use_cache=True)[0]

print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())

Prompt Format

The model follow the ChatML standard, however, without \n at the end of <|im_end|>:

<|im_start|>system
Answer the question<|im_end|><|im_start|>user
<image>
What is the picture about?<|im_end|><|im_start|>assistant

Model is trained using a modified version from Bunny

Downloads last month
584
Safetensors
Model size
1.05B params
Tensor type
BF16
Β·
Inference Examples
Inference API (serverless) does not yet support transformers models for this pipeline type.

Model tree for qnguyen3/nanoLLaVA-1.5

Finetunes
1 model
Quantizations
1 model

Spaces using qnguyen3/nanoLLaVA-1.5 2

Collection including qnguyen3/nanoLLaVA-1.5