Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
KnutJaegersberg 
posted an update Jan 2, 2024
Post
Microsoft: Improving Text Embeddings with Large Language Models

- uses an LLM instead of complex pipelines to create the training data
- directly generates data for numerous text embedding tasks
- fine tunes standard models with contrastative loss achieving great performance
- critical thought: isn't this kinda benchmark hacking? If the benchmarks are so encompassing that they capture the complete idea of embedding, it's maybe a good idea, but often it is oversimplifying, I find.

Feel free to share your thoughts, even if they like mine don't beat the benchmarks ;P


https://arxiv.org/abs/2401.00368

Linking the HF paper page as well: http://huggingface.co/papers/2401.00368

The fact they used only synthetic data is huge IMO - makes this almost an unsupervised training setup

I guess we'll see more and more techniques like that based on foundational LLMs!

In fact, to add more context, the authors mentioned that they will release some more content in the upcoming revision of the paper which is nice, because that would imply that anyone could run a faithful reproduction of their synthetic data generation process. See the reply from the authors at https://huggingface.co/papers/2401.00368#65978d195f689f3f0b2caeb9.

Also worth mentioning that @andersonbcdefg ran both stages:

(Unsure if the reproduction of the second stage is faithful to the original, but asked them at https://twitter.com/alvarobartt/status/1742839431881490717, anyway I think we may need to wait for the authors to share the full details on the prompting strategies for the generation).