Dose this model support mixed-modality input?
#29
by
Haon-Chen
- opened
For example, text+image on both query and document side.
Hey @Haon-Chen , thanks for reaching out! Our model suffers from the modality gap to some extent, like most CLIP-like models. This means that a text query will always favor a text document. I would suggest to stick with cross-modal search (text-to-image and image-to-text) or uni-modal search (text-to-text and image-to-image) and mix the results using heuristics