File size: 1,692 Bytes
33a4783 b42a347 33a4783 25b8f02 33a4783 b9bc4aa 33a4783 b9bc4aa 33a4783 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
---
license: creativeml-openrail-m
language:
- en
metrics:
- bleu
---
<h3 align='center' style='font-size: 24px;'>Blazzing Fast Tiny Vision Language Model</h3>
<p align='center', style='font-size: 16px;' >A Custom 3B parameter Model. Built by <a href="https://www.linkedin.com/in/manishkumarthota/">@Manish</a> The model is released for research purposes only, commercial use is not allowed. </p>
## How to use
**Install dependencies**
```bash
pip install transformers # latest version is ok, but we recommend v4.31.0
pip install -q pillow accelerate einops
```
You can use the following code for model inference. The format of text instruction is similar to [LLaVA](https://github.com/haotian-liu/LLaVA).
```Python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
torch.set_default_device("cuda")
#Create model
model = AutoModelForCausalLM.from_pretrained(
"ManishThota/CustomModel",
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ManishThota/CustomModel", trust_remote_code=True)
#function to generate the answer
def predict(question, image_path):
#Set inputs
text = f"USER: <image>\n{question}? ASSISTANT:"
image = Image.open(image_path)
input_ids = tokenizer(text, return_tensors='pt').input_ids.to('cuda')
image_tensor = model.image_preprocess(image)
#Generate the answer
output_ids = model.generate(
input_ids,
max_new_tokens=25,
images=image_tensor,
use_cache=True)[0]
return tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip()
``` |