cnmoro commited on
Commit
47bf0e9
·
verified ·
1 Parent(s): 541365d

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -165
README.md CHANGED
@@ -9044,173 +9044,12 @@ model-index:
9044
  task:
9045
  type: PairClassification
9046
  ---
9047
- <h1 align="center">Snowflake's Arctic-embed-m-v2.0</h1>
9048
- <h4 align="center">
9049
- <p>
9050
- <a href=#news>News</a> |
9051
- <a href=#models>Models</a> |
9052
- <a href=#usage>Usage</a> |
9053
- <a href="#evaluation">Evaluation</a> |
9054
- <a href="#contact">Contact</a> |
9055
- <a href="#faq">FAQ</a>
9056
- <a href="#license">License</a> |
9057
- <a href="#acknowledgement">Acknowledgement</a>
9058
- <p>
9059
- </h4>
9060
-
9061
- <img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=d5cb84e7-4b3a-4d82-85a1-19ec3721c447" />
9062
-
9063
- ## News
9064
- - 12/11/2024: Release of [Technical Report](https://arxiv.org/abs/2412.04506)
9065
- - 12/04/2024: Release of [snowflake-arctic-embed-l-v2.0](https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0) and [snowflake-arctic-embed-m-v2.0](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0) our newest models with multilingual workloads in mind.
9066
-
9067
-
9068
- ## Models
9069
- Snowflake arctic-embed-m-v2.0 is the newest addition to the suite of embedding models Snowflake has released optimizing for retrieval performance and inference efficiency.
9070
- Arctic Embed 2.0 introduces a new standard for multilingual embedding models, combining high-quality multilingual text retrieval without sacrificing performance in English.
9071
- Released under the permissive Apache 2.0 license, Arctic Embed 2.0 is ideal for applications that demand reliable, enterprise-grade multilingual search and retrieval at scale.
9072
-
9073
- Key Features:
9074
-
9075
- 1. Multilingual without compromise: Excels in English and non-English retrieval, outperforming leading open-source and proprietary models on benchmarks like MTEB Retrieval, CLEF, and MIRACL.
9076
-
9077
- 2. Inference efficiency: Its 113m non-embedding parameters inference is fast and efficient for any scale.
9078
-
9079
- 3. Compression-friendly: Achieves high-quality retrieval with embeddings as small as 128 bytes/vector using Matryoshka Representation Learning (MRL) and quantization-aware embedding training.
9080
-
9081
- 4. Long Context Support: arctic-embed-m-v2.0 builds on [GTE-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) which can support a context window of up to 8192 via the use of RoPE.
9082
-
9083
-
9084
- ### Quality Benchmarks
9085
- Unlike most other open-source models, Arctic-embed-m-v2.0 excels across English (via MTEB Retrieval) and multilingual (via MIRACL and CLEF).
9086
- You no longer need to support models to empower high-quality English and multilingual retrieval. All numbers mentioned below are the average NDCG@10 across the dataset being discussed.
9087
-
9088
- | Model Name | # params | # non-emb params | # dimensions | BEIR (15) | MIRACL (4) | CLEF (Focused) | CLEF (Full) |
9089
- |---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
9090
- | **snowflake-arctic-m-v2.0** | 305M | 113M | 768 | **55.4** | 55.2 | **51.7** | **53.9** |
9091
- | snowflake-arctic-m | 109M | 86M | 768 | 54.9 | 24.9 | 34.4 | 29.1 |
9092
- | me5 base | 560M | 303M | 1024 | 51.4 | 54.0 | 43.0 | 34.6 |
9093
- | bge-m3 (BAAI) | 568M | 303M | 1024 | 48.8 | **56.8** | 40.8 | 41.3 |
9094
- | gte (Alibaba) | 305M | 113M | 768 | 51.1 | 52.3 | 47.7 | 53.1 |
9095
-
9096
- Aside from high-quality retrieval, arctic delivers embeddings that are easily compressible. By leveraging vector truncation via MRL to decrease vector size by 3x with about 3% degradation in quality.
9097
- Combine MRLed vectors with vector compression (Int4) to power retrieval in 128 bytes per doc.
9098
-
9099
- | Model | | BEIR (15) | Relative Performance | MIRACL (4) | Relative Performance | CLEF (5) | Relative Performance | CLEF (Full) | Relative Performance |
9100
- |---|---|:---:|:---:|:---:|:---:|:---:|---|---|---|
9101
- | snowflake-arctic-m-v2.0 | 768 | 55.4 | N/A | 55.2 | N/A | 51.7 | N/A | 53.9 | N/A |
9102
- | snowflake-arctic-m-v2.0 | 256 | 54.4 | -1.81% | 54.0 | -2.17% | 50.6 | -2.13% | 52.3 | -3.06% |
9103
-
9104
- ## Usage
9105
-
9106
- ### Using Sentence Transformers
9107
 
9108
  ```python
9109
  from sentence_transformers import SentenceTransformer
9110
-
9111
- # Load the model
9112
- model_name = 'Snowflake/snowflake-arctic-embed-m-v2.0'
9113
- model = SentenceTransformer(model_name, trust_remote_code=True)
9114
-
9115
- # Define the queries and documents
9116
- queries = ['what is snowflake?', 'Where can I get the best tacos?']
9117
- documents = ['The Data Cloud!', 'Mexico City of Course!']
9118
-
9119
- # Compute embeddings: use `prompt_name="query"` to encode queries!
9120
- query_embeddings = model.encode(queries, prompt_name="query")
9121
- document_embeddings = model.encode(documents)
9122
-
9123
- # Compute cosine similarity scores
9124
- scores = model.similarity(query_embeddings, document_embeddings)
9125
-
9126
- # Output the results
9127
- for query, query_scores in zip(queries, scores):
9128
- doc_score_pairs = list(zip(documents, query_scores))
9129
- doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
9130
- print("Query:", query)
9131
- for document, score in doc_score_pairs:
9132
- print(score, document)
9133
-
9134
- ```
9135
-
9136
- ### Using Huggingface Transformers
9137
-
9138
-
9139
- You can use the transformers package to use Snowflake's arctic-embed model, as shown below. For optimal retrieval quality, use the CLS token to embed each text portion and use the query prefix below (just on the query).
9140
-
9141
- ```python
9142
  import torch
9143
- from transformers import AutoModel, AutoTokenizer
9144
-
9145
- model_name = 'Snowflake/snowflake-arctic-embed-m-v2.0'
9146
- tokenizer = AutoTokenizer.from_pretrained(model_name)
9147
- model = AutoModel.from_pretrained(model_name, add_pooling_layer=False, trust_remote_code=True)
9148
- model.eval()
9149
-
9150
- query_prefix = 'query: '
9151
- queries = ['what is snowflake?', 'Where can I get the best tacos?']
9152
- queries_with_prefix = ["{}{}".format(query_prefix, i) for i in queries]
9153
- query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=8192)
9154
-
9155
- documents = ['The Data Cloud!', 'Mexico City of Course!']
9156
- document_tokens = tokenizer(documents, padding=True, truncation=True, return_tensors='pt', max_length=8192)
9157
-
9158
- # Compute token embeddings
9159
- with torch.no_grad():
9160
- query_embeddings = model(**query_tokens)[0][:, 0]
9161
- document_embeddings = model(**document_tokens)[0][:, 0]
9162
-
9163
-
9164
- # normalize embeddings
9165
- query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)
9166
- document_embeddings = torch.nn.functional.normalize(document_embeddings, p=2, dim=1)
9167
-
9168
- scores = torch.mm(query_embeddings, document_embeddings.transpose(0, 1))
9169
- for query, query_scores in zip(queries, scores):
9170
- doc_score_pairs = list(zip(documents, query_scores))
9171
- doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
9172
- #Output passages & scores
9173
- print("Query:", query)
9174
- for document, score in doc_score_pairs:
9175
- print(score, document)
9176
- ```
9177
-
9178
- ### Using Huggingface Transformers.js
9179
-
9180
- If you haven't already, you can install the [Transformers.js](https://huggingface.co/docs/transformers.js) JavaScript library from [NPM](https://www.npmjs.com/package/@huggingface/transformers) using:
9181
- ```bash
9182
- npm i @huggingface/transformers
9183
- ```
9184
-
9185
- You can then use the model for retrieval, as follows:
9186
-
9187
- ```js
9188
- import { pipeline, dot } from '@huggingface/transformers';
9189
-
9190
- // Create feature extraction pipeline
9191
- const extractor = await pipeline('feature-extraction', 'Snowflake/snowflake-arctic-embed-m-v2.0');
9192
-
9193
- // Generate sentence embeddings
9194
- const sentences = [
9195
- 'query: what is snowflake?',
9196
- 'The Data Cloud!',
9197
- 'Mexico City of Course!',
9198
- ]
9199
- const output = await extractor(sentences, { normalize: true, pooling: 'cls' });
9200
-
9201
- // Compute similarity scores
9202
- const [source_embeddings, ...document_embeddings ] = output.tolist();
9203
- const similarities = document_embeddings.map(x => dot(source_embeddings, x));
9204
- console.log(similarities); // [0.32719788157046004, 0.06960141111667434]
9205
- ```
9206
-
9207
-
9208
- ## Contact
9209
-
9210
-
9211
- Feel free to open an issue or pull request if you have any questions or suggestions about this project.
9212
- You also can email Daniel Campos([email protected]).
9213
-
9214
 
9215
- ## License
9216
- Arctic is licensed under the [Apache-2](https://www.apache.org/licenses/LICENSE-2.0). The released models can be used for commercial purposes free of charge.
 
 
9044
  task:
9045
  type: PairClassification
9046
  ---
9047
+ A modified version of [Snowflake/snowflake-arctic-embed-m-v2.0](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0), without xformers, so it works on CPU.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9048
 
9049
  ```python
9050
  from sentence_transformers import SentenceTransformer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9051
  import torch
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9052
 
9053
+ device = torch.device("cpu")
9054
+ model = SentenceTransformer("cnmoro/snowflake-arctic-embed-m-v2.0-cpu", device=device, trust_remote_code=True)
9055
+ ```