Update README.md
Browse files
README.md
CHANGED
@@ -203,7 +203,7 @@ and the rest of the languages were kept as is, resulting in the following distri
|
|
203 |
This highly multilingual corpus is predominantly composed of data from Colossal OSCAR,
|
204 |
which contributes a significant 66.06% of the total tokens.
|
205 |
Following this, Starcoder provides 11.91%, and Spanish Crawling adds 3.34%.
|
206 |
-
The next largest sources are French
|
207 |
Other notable contributions include Macocu, Pile of Law, and Eurlex, each contributing around 1.5% to 1.3%.
|
208 |
These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
|
209 |
The remaining 10% comes from smaller sources in various languages.
|
@@ -217,7 +217,6 @@ Feel free to click the expand button below to see the full list of sources.
|
|
217 |
|-----------------------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|
|
218 |
| Parlamint corpus | at, bg, cz, dk, ee, es, es-ga, fi, fr, gb, gr, hr, hu, it, lv, nl, no, pl, pt, rs, se, si | Erjavec et al., 2021 |
|
219 |
| Bulgarian National Corpus | bg | [Link](http://old.dcl.bas.bg/dataset/BulNC.7z) |
|
220 |
-
| Crawl of Bulgarian news websites | bg | [Link](http://old.dcl.bas.bg/dataset/Bulgarian_news.7z) |
|
221 |
| Colossal OSCAR 1.0 | bg, ca, cs, cy, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sh, sk, sl, sr, sv, uk | Brack et al., 2024 |
|
222 |
| Wikimedia dumps | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, pl, pt, ro, sh, sk, sl, sr, uk | [Link](https://dumps.wikimedia.org/) |
|
223 |
| OpenSubtitlesv2016 | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, gl, hr, it, lt, lv, nl, no, pl, pt, ro, sk, sl, sr, sv, uk | Lison & Tiedemann, 2016 |
|
|
|
203 |
This highly multilingual corpus is predominantly composed of data from Colossal OSCAR,
|
204 |
which contributes a significant 66.06% of the total tokens.
|
205 |
Following this, Starcoder provides 11.91%, and Spanish Crawling adds 3.34%.
|
206 |
+
The next largest sources are French PD at 3.12% and Proof Pile at 1.98%.
|
207 |
Other notable contributions include Macocu, Pile of Law, and Eurlex, each contributing around 1.5% to 1.3%.
|
208 |
These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
|
209 |
The remaining 10% comes from smaller sources in various languages.
|
|
|
217 |
|-----------------------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|
|
218 |
| Parlamint corpus | at, bg, cz, dk, ee, es, es-ga, fi, fr, gb, gr, hr, hu, it, lv, nl, no, pl, pt, rs, se, si | Erjavec et al., 2021 |
|
219 |
| Bulgarian National Corpus | bg | [Link](http://old.dcl.bas.bg/dataset/BulNC.7z) |
|
|
|
220 |
| Colossal OSCAR 1.0 | bg, ca, cs, cy, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sh, sk, sl, sr, sv, uk | Brack et al., 2024 |
|
221 |
| Wikimedia dumps | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, pl, pt, ro, sh, sk, sl, sr, uk | [Link](https://dumps.wikimedia.org/) |
|
222 |
| OpenSubtitlesv2016 | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, gl, hr, it, lt, lv, nl, no, pl, pt, ro, sk, sl, sr, sv, uk | Lison & Tiedemann, 2016 |
|