Add similarity measure used for each model evaluation

#44
by sjrhuschlee - opened

When training these models typically a similarity measure is chosen (usually cosine or dot product) and this similarity measure should also be used during inference time. If the wrong one is used (e.g. cosine instead of dot product) the performance of the embedding model can degrade significantly. So I think it would be extremely helpful to add a column with the recommended similarity measure or the one that produced the evaluation results. This would be similar to how SBERT has an entry for suitable distance metrics to use with their models.

Screenshot 2023-10-26 at 16.24.22.png

Massive Text Embedding Benchmark org

Great point. Unfortunately, I don't know of any way to automatically extract that from the folder of a submitted model, so we would have to annotate every model manually.

Yeah, that is a fair point. I think some of this could potentially be automated if the model folder contains the necessary files to be compatible with Sentence Transformers since in one of their configs it is possible to specify if the embedding vector should be normalized. And if the vector is normalized then we know the appropriate similarity should be cosine. For example, for the intfloat/e5-base-v2 model we can see this file https://huggingface.co/intfloat/e5-base-v2/blob/main/modules.json which contains the following:

    {
      "idx": 2,
      "name": "2",
      "path": "2_Normalize",
      "type": "sentence_transformers.models.Normalize"
    }

which tells us that cosine similarity is the most likely preferred similarity measure.

I'm not entirely familiar with how submissions are made to MTEB, but I wonder if it is also possible to ask new submissions for this information when being added to the leaderboard?

Hey @Muennighoff pinging to ask if you think checking for the modules.json file can get us a partial solution?

Massive Text Embedding Benchmark org

Yeah so that would only work for SentenceTransformers and I'm not sure this is true in 100% of cases: if the vector is normalized then we know the appropriate similarity should be cosine.

I'm curious why you would want this information? Does it affect your model choice? I think MTEB is meant to help you choose the best model - then you will need to look at that model itself anyways to figure out how to load it etc.

That’s fair, I guess if the vector is normalized we know that dot product and cosine would provide the same similarity and that’s typically the two measures I’ve seen for these types of models. But it does not mean other similarities like Euclidean distance would not also be valid.

Right, I agree I think MTEB is really great for choosing a model, but I often find the model cards don’t always provide sufficient information on how to best use their model when calculating similarities. Even though they used this information when running the MTEB benchmark. So it often feels like I’m not 100% certain that I’ll get the same performance as reported on the leader board because I happened to load the model incorrectly. As a result I often experiment with different similarities on a locally on a benchmark to double check I’m getting the expected performance. And it would be nice if we could skip that step.

Massive Text Embedding Benchmark org

Hmm I see. Currently submission works by just fetching model cards.

But we could switch to something like this for submission: https://huggingface.co/spaces/gaia-benchmark/leaderboard & then we could ask for the similarity metric; maybe even an example code snippet to reproduce it.

That would be great! And I like the idea about the code snippet, that would really help speed up the correct usage of these models.

Sign up or log in to comment