izhx commited on
Commit
75b53a6
·
verified ·
1 Parent(s): a0d6174

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -9
README.md CHANGED
@@ -2624,7 +2624,8 @@ a SOTA instruction-tuned multi-lingual embedding model that ranked 2nd in MTEB a
2624
 
2625
  - **Developed by:** Institute for Intelligent Computing, Alibaba Group
2626
  - **Model type:** Text Embeddings
2627
- - **Paper:** Coming soon.
 
2628
 
2629
  <!-- - **Demo [optional]:** [More Information Needed] -->
2630
 
@@ -2719,7 +2720,7 @@ console.log(similarities); // [41.86354093370361, 77.07076371259589, 37.02981979
2719
  ### Training Data
2720
 
2721
  - Masked language modeling (MLM): `c4-en`
2722
- - Weak-supervised contrastive (WSC) pre-training: [GTE](https://arxiv.org/pdf/2308.03281.pdf) pre-training data
2723
  - Supervised contrastive fine-tuning: GTE(https://arxiv.org/pdf/2308.03281.pdf) fine-tuning data
2724
 
2725
  ### Training Procedure
@@ -2731,8 +2732,8 @@ And then, we resample the data, reducing the proportion of short texts, and cont
2731
  The entire training process is as follows:
2732
  - MLM-512: lr 2e-4, mlm_probability 0.3, batch_size 4096, num_steps 300000, rope_base 10000
2733
  - MLM-2048: lr 5e-5, mlm_probability 0.3, batch_size 4096, num_steps 30000, rope_base 10000
2734
- - MLM-8192: lr 5e-5, mlm_probability 0.3, batch_size 1024, num_steps 30000, rope_base 160000
2735
- - WSC: max_len 512, lr 5e-5, batch_size 28672, num_steps 100000
2736
  - Fine-tuning: TODO
2737
 
2738
 
@@ -2770,10 +2771,13 @@ The gte evaluation setting: `mteb==1.2.0, fp16 auto mix precision, max_length=81
2770
  If you find our paper or models helpful, please consider citing them as follows:
2771
 
2772
  ```
2773
- @article{li2023towards,
2774
- title={Towards general text embeddings with multi-stage contrastive learning},
2775
- author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
2776
- journal={arXiv preprint arXiv:2308.03281},
2777
- year={2023}
 
 
 
2778
  }
2779
  ```
 
2624
 
2625
  - **Developed by:** Institute for Intelligent Computing, Alibaba Group
2626
  - **Model type:** Text Embeddings
2627
+ - **Paper:** [mGTE: Generalized Long-Context Text Representation and Reranking
2628
+ Models for Multilingual Text Retrieval](https://arxiv.org/pdf/2407.19669)
2629
 
2630
  <!-- - **Demo [optional]:** [More Information Needed] -->
2631
 
 
2720
  ### Training Data
2721
 
2722
  - Masked language modeling (MLM): `c4-en`
2723
+ - Weak-supervised contrastive pre-training (CPT): [GTE](https://arxiv.org/pdf/2308.03281.pdf) pre-training data
2724
  - Supervised contrastive fine-tuning: GTE(https://arxiv.org/pdf/2308.03281.pdf) fine-tuning data
2725
 
2726
  ### Training Procedure
 
2732
  The entire training process is as follows:
2733
  - MLM-512: lr 2e-4, mlm_probability 0.3, batch_size 4096, num_steps 300000, rope_base 10000
2734
  - MLM-2048: lr 5e-5, mlm_probability 0.3, batch_size 4096, num_steps 30000, rope_base 10000
2735
+ - [MLM-8192](https://huggingface.co/Alibaba-NLP/gte-en-mlm-large): lr 5e-5, mlm_probability 0.3, batch_size 1024, num_steps 30000, rope_base 160000
2736
+ - CPT: max_len 512, lr 5e-5, batch_size 28672, num_steps 100000
2737
  - Fine-tuning: TODO
2738
 
2739
 
 
2771
  If you find our paper or models helpful, please consider citing them as follows:
2772
 
2773
  ```
2774
+ @misc{zhang2024mgte,
2775
+ title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
2776
+ author={Xin Zhang and Yanzhao Zhang and Dingkun Long and Wen Xie and Ziqi Dai and Jialong Tang and Huan Lin and Baosong Yang and Pengjun Xie and Fei Huang and Meishan Zhang and Wenjie Li and Min Zhang},
2777
+ year={2024},
2778
+ eprint={2407.19669},
2779
+ archivePrefix={arXiv},
2780
+ primaryClass={cs.CL},
2781
+ url={https://arxiv.org/abs/2407.19669},
2782
  }
2783
  ```