Alibaba-NLP
/

gte-large-en-v1.5

@@ -2624,7 +2624,8 @@ a SOTA instruction-tuned multi-lingual embedding model that ranked 2nd in MTEB a
 - **Developed by:** Institute for Intelligent Computing, Alibaba Group
 - **Model type:** Text Embeddings
-- **Paper:** Coming soon.
 <!-- - **Demo [optional]:** [More Information Needed] -->
@@ -2719,7 +2720,7 @@ console.log(similarities); // [41.86354093370361, 77.07076371259589, 37.02981979
 ### Training Data
 - Masked language modeling (MLM): `c4-en`
-- Weak-supervised contrastive (WSC) pre-training: [GTE](https://arxiv.org/pdf/2308.03281.pdf) pre-training data
 - Supervised contrastive fine-tuning: GTE(https://arxiv.org/pdf/2308.03281.pdf) fine-tuning data
 ### Training Procedure
@@ -2731,8 +2732,8 @@ And then, we resample the data, reducing the proportion of short texts, and cont
 The entire training process is as follows:
 - MLM-512: lr 2e-4, mlm_probability 0.3, batch_size 4096, num_steps 300000, rope_base 10000
 - MLM-2048: lr 5e-5, mlm_probability 0.3, batch_size 4096, num_steps 30000, rope_base 10000
-- MLM-8192: lr 5e-5, mlm_probability 0.3, batch_size 1024, num_steps 30000, rope_base 160000
-- WSC: max_len 512, lr 5e-5, batch_size 28672, num_steps 100000
 - Fine-tuning: TODO
@@ -2770,10 +2771,13 @@ The gte evaluation setting: `mteb==1.2.0, fp16 auto mix precision, max_length=81
 If you find our paper or models helpful, please consider citing them as follows:
 ```
-@article{li2023towards,
-  title={Towards general text embeddings with multi-stage contrastive learning},
-  author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
-  journal={arXiv preprint arXiv:2308.03281},
-  year={2023}
 }
 ```

 - **Developed by:** Institute for Intelligent Computing, Alibaba Group
 - **Model type:** Text Embeddings
+- **Paper:** [mGTE: Generalized Long-Context Text Representation and Reranking
+Models for Multilingual Text Retrieval](https://arxiv.org/pdf/2407.19669)
 <!-- - **Demo [optional]:** [More Information Needed] -->
 ### Training Data
 - Masked language modeling (MLM): `c4-en`
+- Weak-supervised contrastive pre-training (CPT): [GTE](https://arxiv.org/pdf/2308.03281.pdf) pre-training data
 - Supervised contrastive fine-tuning: GTE(https://arxiv.org/pdf/2308.03281.pdf) fine-tuning data
 ### Training Procedure
 The entire training process is as follows:
 - MLM-512: lr 2e-4, mlm_probability 0.3, batch_size 4096, num_steps 300000, rope_base 10000
 - MLM-2048: lr 5e-5, mlm_probability 0.3, batch_size 4096, num_steps 30000, rope_base 10000
+- [MLM-8192](https://huggingface.co/Alibaba-NLP/gte-en-mlm-large): lr 5e-5, mlm_probability 0.3, batch_size 1024, num_steps 30000, rope_base 160000
+- CPT: max_len 512, lr 5e-5, batch_size 28672, num_steps 100000
 - Fine-tuning: TODO
 If you find our paper or models helpful, please consider citing them as follows:
 ```
+@misc{zhang2024mgte,
+  title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
+  author={Xin Zhang and Yanzhao Zhang and Dingkun Long and Wen Xie and Ziqi Dai and Jialong Tang and Huan Lin and Baosong Yang and Pengjun Xie and Fei Huang and Meishan Zhang and Wenjie Li and Min Zhang},
+  year={2024},
+  eprint={2407.19669},
+  archivePrefix={arXiv},
+  primaryClass={cs.CL},
+  url={https://arxiv.org/abs/2407.19669},
 }
 ```