Microsoft: Improving Text Embeddings with Large Language Models
- uses an LLM instead of complex pipelines to create the training data - directly generates data for numerous text embedding tasks - fine tunes standard models with contrastative loss achieving great performance - critical thought: isn't this kinda benchmark hacking? If the benchmarks are so encompassing that they capture the complete idea of embedding, it's maybe a good idea, but often it is oversimplifying, I find.
Feel free to share your thoughts, even if they like mine don't beat the benchmarks ;P
Currently attempting to hack EvoDiff to generate binders for target proteins with some interesting results. The generated binders tend to change conformation, sometimes drastically, when bound to the target proteins compared to their unbound states. Below is the target protein with an IDR linker, the generated binder, and the binder bound to the target protein with the IDR linker structure as predicted by ESMFold. Notice how the binder goes from being a solid alpha-helix, to being beta-sheets (in orange). That's quite a change in tertiary structure!