Owl-vit batch images inference

Dear hugging face users,

I’m trying to implement batch images inference on Owl-Vit. At the moment, I’m working on a set of 11 images, with 72 labels and batch_size=2. I get information how to implement batch size from here:

https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/zeroshot_object_detection_with_owlvit.ipynb#scrollTo=-Wc92cWK-Aas

with the only different I’m using “google/owlvit-large-patch14” model instead of “google/owlvit-large-patch32”. The code works well for first two images, but on third, I get:

RuntimeError: shape '[4, 37, 768]' is invalid for input of size 115200

here:

with torch.no_grad():
    outputs = model(**inputs)

I don’t understand what such shapes are. Are referring to image in process or the underlying net? Maybe I made some mistakes? I’m using too much labels? Thanks.

cc @adirik

A bit late to answer this but this might be due to how you’re not batching your text queries. Also keep in mind it seems that you might’ve hit max text tokens you can pass to OWL which uses CLIP tokenizer.