meta-llama/Llama-3.2-11B-Vision-Instruct did not reply

Dears,

I tried to use meta-llama/Llama-3.2-11B-Vision-Instruct model on my local Laptop, using the same example as stated in the hugging face Model card. but it took more than 10hours without giving me the needed reply on the provided image. please any help to test the example on local laptop with reasonable time response.
below the code that I am using. also I got the needed access to the HF meta/llama model using
huggingface-cli login and paste the needed token and successfully took the access.

pip install --upgrade transformers

from transformers import AutoProcessor, AutoModelForPreTraining
processor = AutoProcessor.from_pretrained(“meta-llama/Llama-3.2-11B-Vision-Instruct”)
model = AutoModelForPreTraining.from_pretrained(“meta-llama/Llama-3.2-11B-Vision-Instruct”)

url = “https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg”
image = Image.open(requests.get(url, stream=True).raw)
image # this shows the image and it was the successful image

#no error for previous code and no delay
#below code took tooooooooooo much time with any error and without any reply- it seems it is in #hangging state

messages = [
{“role”: “user”, “content”: [
{“type”: “image”},
{“type”: “text”, “text”: "If I had to write a haiku for this one, it would be: "}
]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
image,
input_text,
add_special_tokens=False,
return_tensors=“pt”
).to(model.device)

output = model.generate(**inputs, max_new_tokens=330)
print(processor.decode(output[0]))

1 Like

That’s normal unless you have a gaming laptop. Even a gaming laptop is a bit tough because the 11B model is very large… you want about 30GB of VRAM, not RAM, so it’s a wonder it didn’t crash…
I’m guessing that the laptop would not have had enough RAM for a normal laptop, so the laptop would have calculated the SSD capacity as virtual RAM… so I would expect it took longer than in a normal environment.

can you share please the HW specs - RAM, VRAM GPU - CPU -SSD for a server that will be used to host meta-llama/Llama-3.2-11B-Vision-Instruct and used in my RAG application that has excellent response time…I need good customer experience.
Thanks for your support…

1 Like

Try it out first. So we can guess the required specs.
If you think CPU space or Zero GPU space is enough, that’s fine. If you want to spend more money and have more freedom, consider a paid plan from HF, Google or other services.
As for Llama3.2 11B, it should somehow work with Colab’s Free with 4-bit quantization. However, it will take some getting used to.

Zero GPU Space (Various restrictions, but can host up to 10 spaces at the same time for $10/month / 40GB VRAM, 80GB RAM)

Free CPU Space (No VRAM, 16GB RAM)

Google Colab Free (16GB VRAM, ?GB RAM)

Thanks John
can you please elaborate more…I did not get the point…I need local HW sepecs…

1 Like

I thought you wanted to do it online…
OK. If you want to use a model up to about 12B with quantization, just buy a GeForce with 12GB or 16GB of VRAM and you’re done. The larger the VRAM, the more you can apply. If you don’t quantize, you will need more than 30GB of VRAM, but if you buy such a GPU individually, you are talking about the price of buying a car…


this is my laptop SPECs == it is good one to run llama3.2-11B-vision-instruct

1 Like

Dear John,
I may go with cloud-based runtime environment- which is cheaper and better, Huggingface, Google cloud…or others.
I saw HF Spaces that you shared with me…
How to create and run MYSIBAWY space on HG …that represent my experience on meta/llama3.2-11b-vision-instruct …
is SPACES free – or we have to Pay for it

1 Like

Your laptop is more powerful than I thought, but it is totally lacking in both RAM and VRAM for a generative AI. That’s normal; even a GeForce 3060 is not enough for an 11B model. Possible laptops will seldom be possible.

As for the price of cloud services, Colab Free is free, HF’s Zero GPU is $10/month for up to 10 simultaneous spaces. There are also Colab paid and Kaggle.
As to which one I would recommend, it depends on how you use it. If you are using it as a chatbot, HF’s Zero GPU is probably the easiest with 40GB of VRAM. However, it cannot be used frequently because of its quota. If you want to train models, Zero GPU is not suitable at all because of quota.
I heard that if you want to train LLMs on HF, it is common to use a paid training function? I don’t know, I’ve never used it.
On the other hand, Colab Free has 16 GB of VRAM, so it can infer the 11B model in the quantized state, but training in the quantized state is a bit difficult, although it is possible, apparently.
I heard that Kaggle and Colab Pro are powerful and can do everything, but for a fee. And I’m not at all familiar with cloud services in general…

Edit:
The HF space is paid for by the people who make it. It’s free to use. If you want to use someone else’s space, you can use their space for free.

Thanks John …
why they name it ZERO GPU

1 Like

I’m actually new to HF, too, about 6 months or so, so I don’t know what was going on when the Zero GPU service was created.
I think it was just because it was kind of cool.:roll_eyes: