Multilingual support?

#1
by MLSDev - opened

Hi I am looking for a RAG embedding model.

I see that jina-emmbeding can be a good option,

And with this distilled can be more affordable to use. (i suppose that I need this for embbeding chunks and the query version for emmbeding user question.

But this distilled version is still multilingual?

Is distilled version accurancy worse that "original" version?

It is still multilingual and indeed somewhat worse than the original version, but it is still very much usable (and 1000 times faster).

Unfortunately you cannot pair the query and passage distilled versions, I'm not entirely sure why, but it doesn't work. :(

thanks for the answer @CISCai

Only one more... For a RAG, where I need embedding documents and retrieval from question... Should I use this model, the query or general distilled?

for your previous answer I suppose that general distilled

Well, depending on your dataset you may want to do things differently.

If there is a high correlation between the queries and passages you might get away with using the same model for both, test a few and see which gives you the best result in this case.

It is however more likely that the correlation is low in most cases, and then you might want to generate synthetic queries on your passages and embed those with the query distilled version. This will add significant overhead to your embedding process, but this can usually be done with a fairly low-cost model and you will still be able to do a super fast query similarity search.

Sign up or log in to comment