Upload README.md
Browse files
README.md
CHANGED
@@ -9044,173 +9044,12 @@ model-index:
|
|
9044 |
task:
|
9045 |
type: PairClassification
|
9046 |
---
|
9047 |
-
|
9048 |
-
<h4 align="center">
|
9049 |
-
<p>
|
9050 |
-
<a href=#news>News</a> |
|
9051 |
-
<a href=#models>Models</a> |
|
9052 |
-
<a href=#usage>Usage</a> |
|
9053 |
-
<a href="#evaluation">Evaluation</a> |
|
9054 |
-
<a href="#contact">Contact</a> |
|
9055 |
-
<a href="#faq">FAQ</a>
|
9056 |
-
<a href="#license">License</a> |
|
9057 |
-
<a href="#acknowledgement">Acknowledgement</a>
|
9058 |
-
<p>
|
9059 |
-
</h4>
|
9060 |
-
|
9061 |
-
<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=d5cb84e7-4b3a-4d82-85a1-19ec3721c447" />
|
9062 |
-
|
9063 |
-
## News
|
9064 |
-
- 12/11/2024: Release of [Technical Report](https://arxiv.org/abs/2412.04506)
|
9065 |
-
- 12/04/2024: Release of [snowflake-arctic-embed-l-v2.0](https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0) and [snowflake-arctic-embed-m-v2.0](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0) our newest models with multilingual workloads in mind.
|
9066 |
-
|
9067 |
-
|
9068 |
-
## Models
|
9069 |
-
Snowflake arctic-embed-m-v2.0 is the newest addition to the suite of embedding models Snowflake has released optimizing for retrieval performance and inference efficiency.
|
9070 |
-
Arctic Embed 2.0 introduces a new standard for multilingual embedding models, combining high-quality multilingual text retrieval without sacrificing performance in English.
|
9071 |
-
Released under the permissive Apache 2.0 license, Arctic Embed 2.0 is ideal for applications that demand reliable, enterprise-grade multilingual search and retrieval at scale.
|
9072 |
-
|
9073 |
-
Key Features:
|
9074 |
-
|
9075 |
-
1. Multilingual without compromise: Excels in English and non-English retrieval, outperforming leading open-source and proprietary models on benchmarks like MTEB Retrieval, CLEF, and MIRACL.
|
9076 |
-
|
9077 |
-
2. Inference efficiency: Its 113m non-embedding parameters inference is fast and efficient for any scale.
|
9078 |
-
|
9079 |
-
3. Compression-friendly: Achieves high-quality retrieval with embeddings as small as 128 bytes/vector using Matryoshka Representation Learning (MRL) and quantization-aware embedding training.
|
9080 |
-
|
9081 |
-
4. Long Context Support: arctic-embed-m-v2.0 builds on [GTE-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) which can support a context window of up to 8192 via the use of RoPE.
|
9082 |
-
|
9083 |
-
|
9084 |
-
### Quality Benchmarks
|
9085 |
-
Unlike most other open-source models, Arctic-embed-m-v2.0 excels across English (via MTEB Retrieval) and multilingual (via MIRACL and CLEF).
|
9086 |
-
You no longer need to support models to empower high-quality English and multilingual retrieval. All numbers mentioned below are the average NDCG@10 across the dataset being discussed.
|
9087 |
-
|
9088 |
-
| Model Name | # params | # non-emb params | # dimensions | BEIR (15) | MIRACL (4) | CLEF (Focused) | CLEF (Full) |
|
9089 |
-
|---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
9090 |
-
| **snowflake-arctic-m-v2.0** | 305M | 113M | 768 | **55.4** | 55.2 | **51.7** | **53.9** |
|
9091 |
-
| snowflake-arctic-m | 109M | 86M | 768 | 54.9 | 24.9 | 34.4 | 29.1 |
|
9092 |
-
| me5 base | 560M | 303M | 1024 | 51.4 | 54.0 | 43.0 | 34.6 |
|
9093 |
-
| bge-m3 (BAAI) | 568M | 303M | 1024 | 48.8 | **56.8** | 40.8 | 41.3 |
|
9094 |
-
| gte (Alibaba) | 305M | 113M | 768 | 51.1 | 52.3 | 47.7 | 53.1 |
|
9095 |
-
|
9096 |
-
Aside from high-quality retrieval, arctic delivers embeddings that are easily compressible. By leveraging vector truncation via MRL to decrease vector size by 3x with about 3% degradation in quality.
|
9097 |
-
Combine MRLed vectors with vector compression (Int4) to power retrieval in 128 bytes per doc.
|
9098 |
-
|
9099 |
-
| Model | | BEIR (15) | Relative Performance | MIRACL (4) | Relative Performance | CLEF (5) | Relative Performance | CLEF (Full) | Relative Performance |
|
9100 |
-
|---|---|:---:|:---:|:---:|:---:|:---:|---|---|---|
|
9101 |
-
| snowflake-arctic-m-v2.0 | 768 | 55.4 | N/A | 55.2 | N/A | 51.7 | N/A | 53.9 | N/A |
|
9102 |
-
| snowflake-arctic-m-v2.0 | 256 | 54.4 | -1.81% | 54.0 | -2.17% | 50.6 | -2.13% | 52.3 | -3.06% |
|
9103 |
-
|
9104 |
-
## Usage
|
9105 |
-
|
9106 |
-
### Using Sentence Transformers
|
9107 |
|
9108 |
```python
|
9109 |
from sentence_transformers import SentenceTransformer
|
9110 |
-
|
9111 |
-
# Load the model
|
9112 |
-
model_name = 'Snowflake/snowflake-arctic-embed-m-v2.0'
|
9113 |
-
model = SentenceTransformer(model_name, trust_remote_code=True)
|
9114 |
-
|
9115 |
-
# Define the queries and documents
|
9116 |
-
queries = ['what is snowflake?', 'Where can I get the best tacos?']
|
9117 |
-
documents = ['The Data Cloud!', 'Mexico City of Course!']
|
9118 |
-
|
9119 |
-
# Compute embeddings: use `prompt_name="query"` to encode queries!
|
9120 |
-
query_embeddings = model.encode(queries, prompt_name="query")
|
9121 |
-
document_embeddings = model.encode(documents)
|
9122 |
-
|
9123 |
-
# Compute cosine similarity scores
|
9124 |
-
scores = model.similarity(query_embeddings, document_embeddings)
|
9125 |
-
|
9126 |
-
# Output the results
|
9127 |
-
for query, query_scores in zip(queries, scores):
|
9128 |
-
doc_score_pairs = list(zip(documents, query_scores))
|
9129 |
-
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
|
9130 |
-
print("Query:", query)
|
9131 |
-
for document, score in doc_score_pairs:
|
9132 |
-
print(score, document)
|
9133 |
-
|
9134 |
-
```
|
9135 |
-
|
9136 |
-
### Using Huggingface Transformers
|
9137 |
-
|
9138 |
-
|
9139 |
-
You can use the transformers package to use Snowflake's arctic-embed model, as shown below. For optimal retrieval quality, use the CLS token to embed each text portion and use the query prefix below (just on the query).
|
9140 |
-
|
9141 |
-
```python
|
9142 |
import torch
|
9143 |
-
from transformers import AutoModel, AutoTokenizer
|
9144 |
-
|
9145 |
-
model_name = 'Snowflake/snowflake-arctic-embed-m-v2.0'
|
9146 |
-
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
9147 |
-
model = AutoModel.from_pretrained(model_name, add_pooling_layer=False, trust_remote_code=True)
|
9148 |
-
model.eval()
|
9149 |
-
|
9150 |
-
query_prefix = 'query: '
|
9151 |
-
queries = ['what is snowflake?', 'Where can I get the best tacos?']
|
9152 |
-
queries_with_prefix = ["{}{}".format(query_prefix, i) for i in queries]
|
9153 |
-
query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=8192)
|
9154 |
-
|
9155 |
-
documents = ['The Data Cloud!', 'Mexico City of Course!']
|
9156 |
-
document_tokens = tokenizer(documents, padding=True, truncation=True, return_tensors='pt', max_length=8192)
|
9157 |
-
|
9158 |
-
# Compute token embeddings
|
9159 |
-
with torch.no_grad():
|
9160 |
-
query_embeddings = model(**query_tokens)[0][:, 0]
|
9161 |
-
document_embeddings = model(**document_tokens)[0][:, 0]
|
9162 |
-
|
9163 |
-
|
9164 |
-
# normalize embeddings
|
9165 |
-
query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)
|
9166 |
-
document_embeddings = torch.nn.functional.normalize(document_embeddings, p=2, dim=1)
|
9167 |
-
|
9168 |
-
scores = torch.mm(query_embeddings, document_embeddings.transpose(0, 1))
|
9169 |
-
for query, query_scores in zip(queries, scores):
|
9170 |
-
doc_score_pairs = list(zip(documents, query_scores))
|
9171 |
-
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
|
9172 |
-
#Output passages & scores
|
9173 |
-
print("Query:", query)
|
9174 |
-
for document, score in doc_score_pairs:
|
9175 |
-
print(score, document)
|
9176 |
-
```
|
9177 |
-
|
9178 |
-
### Using Huggingface Transformers.js
|
9179 |
-
|
9180 |
-
If you haven't already, you can install the [Transformers.js](https://huggingface.co/docs/transformers.js) JavaScript library from [NPM](https://www.npmjs.com/package/@huggingface/transformers) using:
|
9181 |
-
```bash
|
9182 |
-
npm i @huggingface/transformers
|
9183 |
-
```
|
9184 |
-
|
9185 |
-
You can then use the model for retrieval, as follows:
|
9186 |
-
|
9187 |
-
```js
|
9188 |
-
import { pipeline, dot } from '@huggingface/transformers';
|
9189 |
-
|
9190 |
-
// Create feature extraction pipeline
|
9191 |
-
const extractor = await pipeline('feature-extraction', 'Snowflake/snowflake-arctic-embed-m-v2.0');
|
9192 |
-
|
9193 |
-
// Generate sentence embeddings
|
9194 |
-
const sentences = [
|
9195 |
-
'query: what is snowflake?',
|
9196 |
-
'The Data Cloud!',
|
9197 |
-
'Mexico City of Course!',
|
9198 |
-
]
|
9199 |
-
const output = await extractor(sentences, { normalize: true, pooling: 'cls' });
|
9200 |
-
|
9201 |
-
// Compute similarity scores
|
9202 |
-
const [source_embeddings, ...document_embeddings ] = output.tolist();
|
9203 |
-
const similarities = document_embeddings.map(x => dot(source_embeddings, x));
|
9204 |
-
console.log(similarities); // [0.32719788157046004, 0.06960141111667434]
|
9205 |
-
```
|
9206 |
-
|
9207 |
-
|
9208 |
-
## Contact
|
9209 |
-
|
9210 |
-
|
9211 |
-
Feel free to open an issue or pull request if you have any questions or suggestions about this project.
|
9212 |
-
You also can email Daniel Campos([email protected]).
|
9213 |
-
|
9214 |
|
9215 |
-
|
9216 |
-
|
|
|
|
9044 |
task:
|
9045 |
type: PairClassification
|
9046 |
---
|
9047 |
+
A modified version of [Snowflake/snowflake-arctic-embed-m-v2.0](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0), without xformers, so it works on CPU.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9048 |
|
9049 |
```python
|
9050 |
from sentence_transformers import SentenceTransformer
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9051 |
import torch
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9052 |
|
9053 |
+
device = torch.device("cpu")
|
9054 |
+
model = SentenceTransformer("cnmoro/snowflake-arctic-embed-m-v2.0-cpu", device=device, trust_remote_code=True)
|
9055 |
+
```
|