jupyterjazz commited on
Commit
3468cf0
·
verified ·
1 Parent(s): e50debb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -6
README.md CHANGED
@@ -134,18 +134,18 @@ Additionally, it features [LoRA](https://arxiv.org/abs/2106.09685) adapters to g
134
  ### Key Features:
135
  - **Extended Sequence Length:** Supports up to 8192 tokens with RoPE.
136
  - **Task-Specific Embedding:** Customize embeddings through the `task_type` argument with the following options:
137
- - `retrieval.query`: Query encoding for asymmetric retrieval tasks
138
- - `retrieval.passage`: Passage encoding for asymmetric retrieval tasks
139
- - `separation`: For clustering and re-ranking applications
140
- - `classification`: For classification tasks
141
- - `text-matching`: For measuring textual similarity
142
  - **Matryoshka Embeddings**: Supports flexible embedding sizes (`32, 64, 128, 256, 512, 768, 1024`), allowing for truncating embeddings to fit your application.
143
 
144
  ### Model Lineage:
145
 
146
  `jina-embeddings-v3` builds upon the [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) model, which was originally trained on 100 languages.
147
  We extended its capabilities with an extra pretraining phase on the [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) dataset,
148
- then contrastively fine-tuned it on 30 languages for enhanced performance in both monolingual and cross-lingual setups.
149
 
150
  ### Supported Languages:
151
  While the base model supports 100 languages, we've focused our tuning efforts on the following 30 languages to maximize performance:
 
134
  ### Key Features:
135
  - **Extended Sequence Length:** Supports up to 8192 tokens with RoPE.
136
  - **Task-Specific Embedding:** Customize embeddings through the `task_type` argument with the following options:
137
+ - `retrieval.query`: Used for query embeddings in asymmetric retrieval tasks
138
+ - `retrieval.passage`: Used for passage embeddings in asymmetric retrieval tasks
139
+ - `separation`: Used for embeddings in clustering and re-ranking applications
140
+ - `classification`: Used for embeddings in classification tasks
141
+ - `text-matching`: Used for embeddings in tasks that quantify similarity between two texts, such as STS or symmetric retrieval tasks
142
  - **Matryoshka Embeddings**: Supports flexible embedding sizes (`32, 64, 128, 256, 512, 768, 1024`), allowing for truncating embeddings to fit your application.
143
 
144
  ### Model Lineage:
145
 
146
  `jina-embeddings-v3` builds upon the [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) model, which was originally trained on 100 languages.
147
  We extended its capabilities with an extra pretraining phase on the [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) dataset,
148
+ then contrastively fine-tuned it on 30 languages for enhanced performance on embedding tasks in both monolingual and cross-lingual setups.
149
 
150
  ### Supported Languages:
151
  While the base model supports 100 languages, we've focused our tuning efforts on the following 30 languages to maximize performance: