Semantic similarity between two texts !
Hello, I have created a small PHP script that utilizes the API with the model /static-proxy?url=https%3A%2F%2Fapi-inference.huggingface.co%2Fmodels%2Fsentence-transformers%2Fall-MiniLM-L6-v2%3C%2Fa%3E%2C designed to evaluate the percentage of semantic similarity between two French texts. However, the results I often receive are around 50%, which doesn't seem very relevant, especially when the texts express the same idea with different words. Typically, I would expect scores between 75 and 100%. Do you think it would be better to use a different model, or are there adjustments that could be made to this model to improve the results?
Hello!
This model has been trained specifically on English texts, so I think you will get better performance with a model trained for multilinguality or for French in specific, such as:
- https://huggingface.co/dangvantuan/sentence-camembert-base
- https://huggingface.co/intfloat/multilingual-e5-small
- https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2
- https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
(Note: I used https://huggingface.co/spaces/mteb/leaderboard?task=sts&language=french and filtered away the larger models)
- Tom Aarsen