jsaizant commited on
Commit
961f7f1
1 Parent(s): 63951cf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -2
README.md CHANGED
@@ -203,7 +203,7 @@ and the rest of the languages were kept as is, resulting in the following distri
203
  This highly multilingual corpus is predominantly composed of data from Colossal OSCAR,
204
  which contributes a significant 66.06% of the total tokens.
205
  Following this, Starcoder provides 11.91%, and Spanish Crawling adds 3.34%.
206
- The next largest sources are French FR at 3.12% and Proof Pile at 1.98%.
207
  Other notable contributions include Macocu, Pile of Law, and Eurlex, each contributing around 1.5% to 1.3%.
208
  These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
209
  The remaining 10% comes from smaller sources in various languages.
@@ -217,7 +217,6 @@ Feel free to click the expand button below to see the full list of sources.
217
  |-----------------------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|
218
  | Parlamint corpus | at, bg, cz, dk, ee, es, es-ga, fi, fr, gb, gr, hr, hu, it, lv, nl, no, pl, pt, rs, se, si | Erjavec et al., 2021 |
219
  | Bulgarian National Corpus | bg | [Link](http://old.dcl.bas.bg/dataset/BulNC.7z) |
220
- | Crawl of Bulgarian news websites | bg | [Link](http://old.dcl.bas.bg/dataset/Bulgarian_news.7z) |
221
  | Colossal OSCAR 1.0 | bg, ca, cs, cy, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sh, sk, sl, sr, sv, uk | Brack et al., 2024 |
222
  | Wikimedia dumps | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, pl, pt, ro, sh, sk, sl, sr, uk | [Link](https://dumps.wikimedia.org/) |
223
  | OpenSubtitlesv2016 | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, gl, hr, it, lt, lv, nl, no, pl, pt, ro, sk, sl, sr, sv, uk | Lison & Tiedemann, 2016 |
 
203
  This highly multilingual corpus is predominantly composed of data from Colossal OSCAR,
204
  which contributes a significant 66.06% of the total tokens.
205
  Following this, Starcoder provides 11.91%, and Spanish Crawling adds 3.34%.
206
+ The next largest sources are French PD at 3.12% and Proof Pile at 1.98%.
207
  Other notable contributions include Macocu, Pile of Law, and Eurlex, each contributing around 1.5% to 1.3%.
208
  These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
209
  The remaining 10% comes from smaller sources in various languages.
 
217
  |-----------------------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|
218
  | Parlamint corpus | at, bg, cz, dk, ee, es, es-ga, fi, fr, gb, gr, hr, hu, it, lv, nl, no, pl, pt, rs, se, si | Erjavec et al., 2021 |
219
  | Bulgarian National Corpus | bg | [Link](http://old.dcl.bas.bg/dataset/BulNC.7z) |
 
220
  | Colossal OSCAR 1.0 | bg, ca, cs, cy, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sh, sk, sl, sr, sv, uk | Brack et al., 2024 |
221
  | Wikimedia dumps | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, pl, pt, ro, sh, sk, sl, sr, uk | [Link](https://dumps.wikimedia.org/) |
222
  | OpenSubtitlesv2016 | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, gl, hr, it, lt, lv, nl, no, pl, pt, ro, sk, sl, sr, sv, uk | Lison & Tiedemann, 2016 |