File size: 1,462 Bytes
91f5b32 9c430a5 91f5b32 9c430a5 91f5b32 9c430a5 7f01759 9c430a5 91f5b32 26c8525 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
---
pipeline_tag: image-text-to-text
tags:
- florence2
- smollm
- custom_code
license: apache-2.0
---
## FloSmolV
A vision model for **Image-text to Text** generation produced by combining [HuggingFaceTB/SmolLM-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM-360M-Instruct) and [microsoft/Florence-2-base](https://huggingface.co/microsoft/Florence-2-base).
The **Florence2-base** models generate texts(captions) from input images significantly faster. This text content can be input for a large language model to
answer questions. **SmolLM-360M** is an excellent model by HuggingFace team to generate rapid text output for input queries. These models are combined together to produce a
Visual Question Answering model which can produce answers from Images.
## Usage
Make sure to install the necessary dependencies.
```bash
pip install -qU transformers accelerate einops bitsandbytes flash_attn timm
```
```python
# load a free image from pixabay
from PIL import Image
import requests
url = "https://cdn.pixabay.com/photo/2023/11/01/11/15/cable-car-8357178_640.jpg"
img = Image.open(requests.get(url, stream=True).raw)
# download model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("dmedhi/flosmolv", trust_remote_code=True).cuda()
model(img, "what is the object in the image?")
```
You can find more about the model and configuration script here: https://huggingface.co/dmedhi/flosmolv/tree/main |