|
--- |
|
pipeline_tag: image-text-to-text |
|
tags: |
|
- florence2 |
|
- smollm |
|
- custom_code |
|
license: apache-2.0 |
|
--- |
|
## FloSmolV |
|
|
|
A vision model for **Image-text to Text** generation produced by combining [HuggingFaceTB/SmolLM-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM-360M-Instruct) and [microsoft/Florence-2-base](https://huggingface.co/microsoft/Florence-2-base). |
|
|
|
The **Florence2-base** models generate texts(captions) from input images significantly faster. This text content can be input for a large language model to |
|
answer questions. **SmolLM-360M** is an excellent model by HuggingFace team to generate rapid text output for input queries. These models are combined together to produce a |
|
Visual Question Answering model which can produce answers from Images. |
|
|
|
## Usage |
|
|
|
Make sure to install the necessary dependencies. |
|
|
|
```bash |
|
pip install -qU transformers accelerate einops bitsandbytes flash_attn timm |
|
``` |
|
```python |
|
# load a free image from pixabay |
|
from PIL import Image |
|
import requests |
|
url = "https://cdn.pixabay.com/photo/2023/11/01/11/15/cable-car-8357178_640.jpg" |
|
img = Image.open(requests.get(url, stream=True).raw) |
|
|
|
# download model |
|
from transformers import AutoModelForCausalLM |
|
model = AutoModelForCausalLM.from_pretrained("dmedhi/flosmolv", trust_remote_code=True).cuda() |
|
model(img, "what is the object in the image?") |
|
``` |
|
|
|
You can find more about the model and configuration script here: https://huggingface.co/dmedhi/flosmolv/tree/main |