446 72 894

Peter Szemraj PRO

pszemraj

https://pszemraj.carrd.co/

pszemraj

AI & ML interests

metallic intuition

Recent Activity

updated a dataset 1 day ago

BEE-spoke-data/govdocs1-by-extension

updated a dataset 1 day ago

BEE-spoke-data/cosmopedia-v2-mincols

updated a dataset 1 day ago

pszemraj/cnn_dailymail-cleaned

View all activity

Organizations

pszemraj's activity

updated 5 datasets 1 day ago

updated a dataset 3 days ago

pszemraj/fineweb-CC-MAIN-2024-10-insurance-700k-dedup-minified

Viewer • Updated 3 days ago • 60k • 2

reacted to tomaarsen's post with 🔥 3 days ago

Post

2348

That didn't take long! Nomic AI has finetuned the new ModernBERT-base encoder model into a strong embedding model for search, classification, clustering and more!

Details:
🤖 Based on ModernBERT-base with 149M parameters.
📊 Outperforms both nomic-embed-text-v1 and nomic-embed-text-v1.5 on MTEB!
🏎️ Immediate FA2 and unpacking support for super efficient inference.
🪆 Trained with Matryoshka support, i.e. 2 valid output dimensionalities: 768 and 256.
➡️ Maximum sequence length of 8192 tokens!
2️⃣ Trained in 2 stages: unsupervised contrastive data -> high quality labeled datasets.
➕ Integrated in Sentence Transformers, Transformers, LangChain, LlamaIndex, Haystack, etc.
🏛️ Apache 2.0 licensed: fully commercially permissible

Try it out here: nomic-ai/modernbert-embed-base

Very nice work by Zach Nussbaum and colleagues at Nomic AI.