Update README.md
Browse files
README.md
CHANGED
@@ -68,7 +68,7 @@ Along with the open weights, all training scripts and configuration files are ma
|
|
68 |
|
69 |
### Description
|
70 |
|
71 |
-
Transformer-based decoder-only language model that has been pre-trained from scratch on
|
72 |
The pre-training corpus contains text in 35 European languages and code.
|
73 |
|
74 |
### Hyperparameters
|
@@ -199,7 +199,7 @@ The initial three training epochs used 2.4 trillion tokens, obtained by manually
|
|
199 |
and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
|
200 |
Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
|
201 |
Following, we trained two additional epochs during which the Colossal OSCAR dataset was replaced with the FineWebEdu dataset.
|
202 |
-
This adjustment resulted in a total of 2.
|
203 |
|
204 |
![lang distrib](./images/corpus_languages.png)
|
205 |
|
@@ -346,8 +346,8 @@ To consult the data summary document with the respective licences, please send a
|
|
346 |
</details>
|
347 |
|
348 |
The model was trained on 3 pre-training epochs with 2.4T tokens per epoch, 2 additional pre-training epochs in which the English part
|
349 |
-
of the Colossal OSCAR dataset was replaced with FineWebEdu (350T subset), resulting in 2.
|
350 |
-
and 1 final
|
351 |
|
352 |
We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
|
353 |
|
|
|
68 |
|
69 |
### Description
|
70 |
|
71 |
+
Transformer-based decoder-only language model that has been pre-trained from scratch on 12.875 trillion tokens of highly curated data.
|
72 |
The pre-training corpus contains text in 35 European languages and code.
|
73 |
|
74 |
### Hyperparameters
|
|
|
199 |
and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
|
200 |
Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
|
201 |
Following, we trained two additional epochs during which the Colossal OSCAR dataset was replaced with the FineWebEdu dataset.
|
202 |
+
This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
|
203 |
|
204 |
![lang distrib](./images/corpus_languages.png)
|
205 |
|
|
|
346 |
</details>
|
347 |
|
348 |
The model was trained on 3 pre-training epochs with 2.4T tokens per epoch, 2 additional pre-training epochs in which the English part
|
349 |
+
of the Colossal OSCAR dataset was replaced with FineWebEdu (350T subset), resulting in 2.68T tokens per epoch;
|
350 |
+
and 1 final epoch of 0.315T higher quality tokens, meaning that the total number of tokens seen during pre-training is approximately 12.875 trillion tokens.
|
351 |
|
352 |
We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
|
353 |
|