Does Llama-3.2 Vision model support MultiImages?
Does this model support Multi Images? if True,like this?
image1 = Image.open(url1)
image2 = Image.open(url2)
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "image"},
{"type": "text", "text": "please describe these two images"}
]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor([image1, image2], input_text, return_tensors="pt").to(model.device)
Thanks for the Q! We recommend using 1 image for inferencing, the model doesn't work reliably well with multiple images
Thanks for the Q! We recommend using 1 image for inferencing, the model doesn't work reliably well with multiple images
Ok~ Thanks for your reply!
Hey Sanyam,
Thanks for the response.
Any idea why this is happening?
Is it a limitation of the model size or the lack of training?
What I understood from the documentation was that the model was trained with videos, so I was curious why it is not performant on multiple images.
I am cuda out of memory message when i use multiple images
I have the same question, can this model infer video files? For example, using cv2 to generate a set of frames?
I have the same question, I am trying to infer video files, extracting frames and transcripts to infer the video on a whole. However, an accumulation of frame understanding is needed instead of single frame inferencing. Llama3.2 vision is unable to do this it seems.
Same here, I would also like to have multi image support in 1 conversation. What is the ETA on this? Will it be supported in the future?
And what about images across history?
messages = [
{
"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "please describe the image"}
]
},
{
"role": "assistant", "content": "It shows a cat fighting with a dog"
},
{
"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "Can you explain more? Here's another perspective"}
]
},
]
did we get an answer to this ?
I have set of images and set of context which I got from my retriever engine - I need to now pass these in my generation Model [ any vision model ] to get the final response