Dose this model support mixed-modality input?

#29

by Haon-Chen - opened 6 days ago

Discussion

Haon-Chen

6 days ago

For example, text+image on both query and document side.

gmastrapas

Jina AI org 5 days ago

•

edited 5 days ago

Hey @Haon-Chen , thanks for reaching out! Our model suffers from the modality gap to some extent, like most CLIP-like models. This means that a text query will always favor a text document. I would suggest to stick with cross-modal search (text-to-image and image-to-text) or uni-modal search (text-to-text and image-to-image) and mix the results using heuristics

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment