ModernBERT Embed

image/png

ModernBERT Embed is an embedding model trained from ModernBERT-base, bringing the new advances of ModernBERT to embeddings!

Trained on the Nomic Embed weakly-supervised and supervised datasets, modernbert-embed also supports Matryoshka Representation Learning dimensions of 256, reducing memory by 3x with minimal performance loss.

Performance

Model Dimensions Average (56) Classification (12) Clustering (11) Pair Classification (3) Reranking (4) Retrieval (15) STS (10) Summarization (1)
nomic-embed-text-v1 768 62.4 74.1 43.9 85.2 55.7 52.8 82.1 30.1
nomic-embed-text-v1.5 768 62.28 73.55 43.93 84.61 55.78 53.01 81.94 30.4
modernbert-embed-base 768 62.62 74.31 44.98 83.96 56.42 52.89 81.78 31.39
nomic-embed-text-v1.5 256 61.04 72.1 43.16 84.09 55.18 50.81 81.34 30.05
modernbert-embed-base 256 61.17 72.40 43.82 83.45 55.69 50.62 81.12 31.27

Usage

You can use these models directly with the transformers library. Until the next transformers release, doing so requires installing transformers from main:

pip install git+https://github.com/huggingface/transformers.git

Reminder, this model is trained similarly to Nomic Embed and REQUIRES prefixes to be added to the input. For more information, see the instructions in Nomic Embed.

Most use cases, adding search_query: to the query and search_document: to the documents will be sufficient.

Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/modernbert-embed-base")

query_embeddings = model.encode([
    "search_query: What is TSNE?",
    "search_query: Who is Laurens van der Maaten?",
])
doc_embeddings = model.encode([
    "search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten",
])
print(query_embeddings.shape, doc_embeddings.shape)
# (2, 768) (1, 768)

similarities = model.similarity(query_embeddings, doc_embeddings)
print(similarities)
# tensor([[0.7214],
#         [0.3260]])
Click to see Sentence Transformers usage with Matryoshka Truncation

In Sentence Transformers, you can truncate embeddings to a smaller dimension by using the truncate_dim parameter when loading the SentenceTransformer model.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/modernbert-embed-base", truncate_dim=256)

query_embeddings = model.encode([
    "search_query: What is TSNE?",
    "search_query: Who is Laurens van der Maaten?",
])
doc_embeddings = model.encode([
    "search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten",
])
print(query_embeddings.shape, doc_embeddings.shape)
# (2, 256) (1, 256)

similarities = model.similarity(query_embeddings, doc_embeddings)
print(similarities)
# tensor([[0.7759],
#         [0.3419]])

Note the small differences compared to the full 768-dimensional similarities.

Transformers

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel


def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = (
        attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    )
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
        input_mask_expanded.sum(1), min=1e-9
    )


queries = ["search_query: What is TSNE?", "search_query: Who is Laurens van der Maaten?"]
documents = ["search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten"]

tokenizer = AutoTokenizer.from_pretrained("nomic-ai/modernbert-embed-base")
model = AutoModel.from_pretrained("nomic-ai/modernbert-embed-base")

encoded_queries = tokenizer(queries, padding=True, truncation=True, return_tensors="pt")
encoded_documents = tokenizer(documents, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    queries_outputs = model(**encoded_queries)
    documents_outputs = model(**encoded_documents)

query_embeddings = mean_pooling(queries_outputs, encoded_queries["attention_mask"])
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
doc_embeddings = mean_pooling(documents_outputs, encoded_documents["attention_mask"])
doc_embeddings = F.normalize(doc_embeddings, p=2, dim=1)
print(query_embeddings.shape, doc_embeddings.shape)
# torch.Size([2, 768]) torch.Size([1, 768])

similarities = query_embeddings @ doc_embeddings.T
print(similarities)
# tensor([[0.7214],
#         [0.3260]])
Click to see Transformers usage with Matryoshka Truncation

In transformers, you can truncate embeddings to a smaller dimension by slicing the mean pooled embeddings, prior to normalization.

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel


def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = (
        attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    )
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
        input_mask_expanded.sum(1), min=1e-9
    )


queries = ["search_query: What is TSNE?", "search_query: Who is Laurens van der Maaten?"]
documents = ["search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten"]

tokenizer = AutoTokenizer.from_pretrained(".")
model = AutoModel.from_pretrained(".")
truncate_dim = 256

encoded_queries = tokenizer(queries, padding=True, truncation=True, return_tensors="pt")
encoded_documents = tokenizer(documents, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    queries_outputs = model(**encoded_queries)
    documents_outputs = model(**encoded_documents)

query_embeddings = mean_pooling(queries_outputs, encoded_queries["attention_mask"])
query_embeddings = query_embeddings[:, :truncate_dim]
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
doc_embeddings = mean_pooling(documents_outputs, encoded_documents["attention_mask"])
doc_embeddings = doc_embeddings[:, :truncate_dim]
doc_embeddings = F.normalize(doc_embeddings, p=2, dim=1)
print(query_embeddings.shape, doc_embeddings.shape)
# torch.Size([2, 256]) torch.Size([1, 256])

similarities = query_embeddings @ doc_embeddings.T
print(similarities)
# tensor([[0.7759],
#         [0.3419]])

Note the small differences compared to the full 768-dimensional similarities.

Transformers.js

import { pipeline } from '@xenova/transformers';

// Create a feature extraction pipeline
const extractor = await pipeline('feature-extraction', 'nomic-ai/modernbert-embed-base', {
    quantized: false, // Comment out this line to use the quantized version
});

// Compute sentence embeddings
const texts = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?'];
const embeddings = await extractor(texts, { pooling: 'mean', normalize: true });
console.log(embeddings);
Click to see Transformers.js usage with different quantizations
import { pipeline } from '@xenova/transformers';

// Create a feature extraction pipeline
const extractor = await pipeline('feature-extraction', 'nomic-ai/modernbert-embed-base', {
    dtype: 'q4f16',
});

// Compute sentence embeddings
const texts = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?'];
const embeddings = await extractor(texts, { pooling: 'mean', normalize: true });
console.log(embeddings);

Training

Click the Nomic Atlas map below to visualize a 5M sample of our contrastive pretraining data!

image/webp

We train our embedder using a multi-stage training pipeline. Starting from a long-context BERT model, the first unsupervised contrastive stage trains on a dataset generated from weakly related text pairs, such as question-answer pairs from forums like StackExchange and Quora, title-body pairs from Amazon reviews, and summarizations from news articles.

In the second finetuning stage, higher quality labeled datasets such as search queries and answers from web searches are leveraged. Data curation and hard-example mining is crucial in this stage.

For more details, see the Nomic Embed Technical Report and corresponding blog post.

Training data to train the models is released in its entirety. For more details, see the contrastors repository

Join the Nomic Community

Citation

If you find the model, dataset, or training code useful, please cite our work

@misc{nussbaum2024nomic,
      title={Nomic Embed: Training a Reproducible Long Context Text Embedder}, 
      author={Zach Nussbaum and John X. Morris and Brandon Duderstadt and Andriy Mulyar},
      year={2024},
      eprint={2402.01613},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Downloads last month
1,886
Safetensors
Model size
149M params
Tensor type
F32
·
Inference API

Model tree for nomic-ai/modernbert-embed-base

Quantized
(2)
this model
Finetunes
3 models

Space using nomic-ai/modernbert-embed-base 1

Evaluation results