Massive Text Embedding Benchmark

non-profit

https://github.com/embeddings-benchmark

embeddings-benchmark

AI & ML interests

Massive Text Embeddings Benchmark

Recent Activity

Muennighoff updated a Space about 4 hours ago

mteb/arena

Muennighoff updated a dataset about 9 hours ago

mteb/arena-results

orionweller updated a dataset about 9 hours ago

mteb/results

View all activity

mteb's activity

Muennighoff

updated a Space about 4 hours ago

MTEB Arena

Muennighoff

updated a dataset about 9 hours ago

mteb/arena-results

Viewer • Updated about 9 hours ago • 3.1k • 1.48k • 1

orionweller

updated a dataset about 9 hours ago

mteb/results

Updated about 9 hours ago • 1.71k • 1

orionweller

updated a Space 1 day ago

Running on CPU Upgrade

MTEB Leaderboard

tomaarsen

posted an update 4 days ago

Post

2348

That didn't take long! Nomic AI has finetuned the new ModernBERT-base encoder model into a strong embedding model for search, classification, clustering and more!

Details:
🤖 Based on ModernBERT-base with 149M parameters.
📊 Outperforms both nomic-embed-text-v1 and nomic-embed-text-v1.5 on MTEB!
🏎️ Immediate FA2 and unpacking support for super efficient inference.
🪆 Trained with Matryoshka support, i.e. 2 valid output dimensionalities: 768 and 256.
➡️ Maximum sequence length of 8192 tokens!
2️⃣ Trained in 2 stages: unsupervised contrastive data -> high quality labeled datasets.
➕ Integrated in Sentence Transformers, Transformers, LangChain, LlamaIndex, Haystack, etc.
🏛️ Apache 2.0 licensed: fully commercially permissible

Try it out here: nomic-ai/modernbert-embed-base

Very nice work by Zach Nussbaum and colleagues at Nomic AI.

mmhamdy

authored a paper 4 days ago

Bridging the Data Provenance Gap Across Text, Speech and Video

Paper • 2412.17847 • Published 16 days ago • 7

orionweller

authored 7 papers 15 days ago

NevIR: Negation in Neural Information Retrieval

Paper • 2305.07614 • Published May 12, 2023

Learning from Task Descriptions

Paper • 2011.08115 • Published Nov 16, 2020

MegaWika: Millions of reports and their sources across 50 diverse languages

Paper • 2307.07049 • Published Jul 13, 2023

Defending Against Poisoning Attacks in Open-Domain Question Answering

Paper • 2212.10002 • Published Dec 20, 2022

Learning to Reason via Program Generation, Emulation, and Search

Paper • 2405.16337 • Published May 25, 2024

CLERC: A Dataset for Legal Case Retrieval and Retrieval-Augmented Analysis Generation

Paper • 2406.17186 • Published Jun 24, 2024 • 1

Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models

Paper • 2409.11136 • Published Sep 17, 2024 • 21

tomaarsen

authored a paper 15 days ago

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Paper • 2412.13663 • Published 17 days ago • 116

orionweller

authored a paper 15 days ago

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Paper • 2412.13663 • Published 17 days ago • 116

swj0419

authored a paper 28 days ago

Negative Token Merging: Image-based Adversarial Feature Guidance

Paper • 2412.01339 • Published Dec 2, 2024 • 22

rbroc

authored a paper 30 days ago

Automated speech- and text-based classification of neuropsychiatric conditions in a multidiagnostic setting

Paper • 2301.06916 • Published Jan 13, 2023

mmhamdy

authored a paper 30 days ago

Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models

Paper • 2412.02980 • Published Dec 4, 2024 • 12

rbroc

authored 2 papers 30 days ago

Large language models surpass human experts in predicting neuroscience results

Paper • 2403.03230 • Published Mar 4, 2024 • 4

$S^3$ -- Semantic Signal Separation

Paper • 2406.09556 • Published Jun 13, 2024