--- pipeline_tag: image-text-to-text tags: - florence2 - smollm - custom_code license: apache-2.0 --- ## FloSmolV A vision model for **Image-text to Text** generation produced by combining [HuggingFaceTB/SmolLM-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM-360M-Instruct) and [microsoft/Florence-2-base](https://huggingface.co/microsoft/Florence-2-base). The **Florence2-base** models generate texts(captions) from input images significantly faster. This text content can be input for a large language model to answer questions. **SmolLM-360M** is an excellent model by HuggingFace team to generate rapid text output for input queries. These models are combined together to produce a Visual Question Answering model which can produce answers from Images. ## Usage Make sure to install the necessary dependencies. ```bash pip install -qU transformers accelerate einops bitsandbytes flash_attn timm ``` ```python # load a free image from pixabay from PIL import Image import requests url = "https://cdn.pixabay.com/photo/2023/11/01/11/15/cable-car-8357178_640.jpg" img = Image.open(requests.get(url, stream=True).raw) # download model from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("dmedhi/flosmolv", trust_remote_code=True).cuda() model(img, "what is the object in the image?") ``` You can find more about the model and configuration script here: https://huggingface.co/dmedhi/flosmolv/tree/main