jsaizant commited on
Commit
68c47d4
·
verified ·
1 Parent(s): 0c36412

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -68,7 +68,7 @@ Along with the open weights, all training scripts and configuration files are ma
68
 
69
  ### Description
70
 
71
- Transformer-based decoder-only language model that has been pre-trained from scratch on 11.675 trillion tokens of highly curated data.
72
  The pre-training corpus contains text in 35 European languages and code.
73
 
74
  ### Hyperparameters
@@ -199,7 +199,7 @@ The initial three training epochs used 2.4 trillion tokens, obtained by manually
199
  and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
200
  Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
201
  Following, we trained two additional epochs during which the Colossal OSCAR dataset was replaced with the FineWebEdu dataset.
202
- This adjustment resulted in a total of 2.08 trillion tokens, distributed as outlined below:
203
 
204
  ![lang distrib](./images/corpus_languages.png)
205
 
@@ -346,8 +346,8 @@ To consult the data summary document with the respective licences, please send a
346
  </details>
347
 
348
  The model was trained on 3 pre-training epochs with 2.4T tokens per epoch, 2 additional pre-training epochs in which the English part
349
- of the Colossal OSCAR dataset was replaced with FineWebEdu (350T subset), resulting in 2.08T tokens per epoch;
350
- and 1 final round of 0.315T higher quality tokens, meaning that the total number of tokens seen during pre-training is approximately 11.675 trillion tokens.
351
 
352
  We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
353
 
 
68
 
69
  ### Description
70
 
71
+ Transformer-based decoder-only language model that has been pre-trained from scratch on 12.875 trillion tokens of highly curated data.
72
  The pre-training corpus contains text in 35 European languages and code.
73
 
74
  ### Hyperparameters
 
199
  and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
200
  Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
201
  Following, we trained two additional epochs during which the Colossal OSCAR dataset was replaced with the FineWebEdu dataset.
202
+ This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
203
 
204
  ![lang distrib](./images/corpus_languages.png)
205
 
 
346
  </details>
347
 
348
  The model was trained on 3 pre-training epochs with 2.4T tokens per epoch, 2 additional pre-training epochs in which the English part
349
+ of the Colossal OSCAR dataset was replaced with FineWebEdu (350T subset), resulting in 2.68T tokens per epoch;
350
+ and 1 final epoch of 0.315T higher quality tokens, meaning that the total number of tokens seen during pre-training is approximately 12.875 trillion tokens.
351
 
352
  We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
353