BSC-LT
/

salamandra-7b-instruct

@@ -203,7 +203,7 @@ and the rest of the languages were kept as is, resulting in the following distri
 This highly multilingual corpus is predominantly composed of data from Colossal OSCAR,
 which contributes a significant 66.06% of the total tokens.
 Following this, Starcoder provides 11.91%, and Spanish Crawling adds 3.34%.
-The next largest sources are French FR at 3.12% and Proof Pile at 1.98%.
 Other notable contributions include Macocu, Pile of Law, and Eurlex, each contributing around 1.5% to 1.3%.
 These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
 The remaining 10% comes from smaller sources in various languages.
@@ -217,7 +217,6 @@ Feel free to click the expand button below to see the full list of sources.
 |-----------------------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|
 | Parlamint corpus                              | at, bg, cz, dk, ee, es, es-ga, fi, fr, gb, gr, hr, hu, it, lv, nl, no, pl, pt, rs, se, si                      | Erjavec et al., 2021                                                                                |
 | Bulgarian National Corpus                     | bg                                                                                                            | [Link](http://old.dcl.bas.bg/dataset/BulNC.7z)                                                       |
-| Crawl of Bulgarian news websites              | bg                                                                                                            | [Link](http://old.dcl.bas.bg/dataset/Bulgarian_news.7z)                                              |
 | Colossal OSCAR 1.0                            | bg, ca, cs, cy, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sh, sk, sl, sr, sv, uk | Brack et al., 2024                                                                                   |
 | Wikimedia dumps                               | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, pl, pt, ro, sh, sk, sl, sr, uk | [Link](https://dumps.wikimedia.org/)                                                                 |
 | OpenSubtitlesv2016                            | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, gl, hr, it, lt, lv, nl, no, pl, pt, ro, sk, sl, sr, sv, uk      | Lison & Tiedemann, 2016                                                                             |

 This highly multilingual corpus is predominantly composed of data from Colossal OSCAR,
 which contributes a significant 66.06% of the total tokens.
 Following this, Starcoder provides 11.91%, and Spanish Crawling adds 3.34%.
+The next largest sources are French PD at 3.12% and Proof Pile at 1.98%.
 Other notable contributions include Macocu, Pile of Law, and Eurlex, each contributing around 1.5% to 1.3%.
 These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
 The remaining 10% comes from smaller sources in various languages.
 |-----------------------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|
 | Parlamint corpus                              | at, bg, cz, dk, ee, es, es-ga, fi, fr, gb, gr, hr, hu, it, lv, nl, no, pl, pt, rs, se, si                      | Erjavec et al., 2021                                                                                |
 | Bulgarian National Corpus                     | bg                                                                                                            | [Link](http://old.dcl.bas.bg/dataset/BulNC.7z)                                                       |
 | Colossal OSCAR 1.0                            | bg, ca, cs, cy, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sh, sk, sl, sr, sv, uk | Brack et al., 2024                                                                                   |
 | Wikimedia dumps                               | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, pl, pt, ro, sh, sk, sl, sr, uk | [Link](https://dumps.wikimedia.org/)                                                                 |
 | OpenSubtitlesv2016                            | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, gl, hr, it, lt, lv, nl, no, pl, pt, ro, sk, sl, sr, sv, uk      | Lison & Tiedemann, 2016                                                                             |