BSC-LT
/

salamandra-2b-instruct

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

dtamayo commited on about 5 hours ago

Commit

8977e3b

·

verified ·

1 Parent(s): aef8e21

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -225,8 +225,8 @@ This adjustment resulted in a total of 2.68 trillion tokens, distributed as outl
 ![lang distrib](./images/corpus_languages_1.1.png)
-The pretraining corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 53,05% of the total tokens.
-Following this, Starcoder provides 13,67%, and FineWeb-Edu (350BT subset) adds 10,24%. The next largest sources are HPLT at 4,21% and French-PD at 3,59%.
 Other notable contributions include MaCoCu, Legal-ES, and EurLex, each contributing around 1.72% to 1.41%.
 These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
 The remaining 10% comes from smaller sources in various languages.

 ![lang distrib](./images/corpus_languages_1.1.png)
+The pretraining corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 53.05% of the total tokens.
+Following this, Starcoder provides 13.67%, and FineWeb-Edu (350BT subset) adds 10.24%. The next largest sources are HPLT at 4.21% and French-PD at 3.59%.
 Other notable contributions include MaCoCu, Legal-ES, and EurLex, each contributing around 1.72% to 1.41%.
 These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
 The remaining 10% comes from smaller sources in various languages.