Add data sources links
Browse files
README.md
CHANGED
@@ -245,30 +245,30 @@ Click the expand button below to see the full list of corpora included in the tr
|
|
245 |
|
246 |
| Dataset | Ca-xx Languages | Es-xx Langugages |
|
247 |
|-----------------------------------------------|----------------------------------------------------------------|-----------------------------------------------|
|
248 |
-
|CCMatrix |eu | |
|
249 |
-
|DGT | |bg,cs,da,de,el ,et,fi,fr,ga,hr,hu,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv |
|
250 |
-
|ELRC-EMEA | |bg,cs,da,hu,lt,lv,mt,pl,ro,sk,sl |
|
251 |
-
|EMEA | |bg,cs,da,el,fi,hu,lt,mt,nl,pl,ro,sk,sl,sv |
|
252 |
-
|EUBookshop |lt,pl,pt |cs,da,de,el,fi,fr,ga,it,lv,mt,nl,pl,pt,ro,sk,sl,sv |
|
253 |
-
|Europarl | |bg,cs,da,el,fi,fr,hu,lt,lv,nl,pl,pt ,ro,sk,sl,sv |
|
254 |
-
|
|
255 |
-
|KDE4 |bg,cs,da,de,el ,et,eu,fi,fr,ga,gl,hr,it,lt,lv,nl,pl,pt,ro,sk,sl,sv |bg,ga,hr |
|
256 |
-
|
|
257 |
-
|GNOME |eu,fr,ga,gl,pt |ga |
|
258 |
-
|JRC-Arquis | |cs,da,et,fr,lt,lv,mt,nl,pl ,ro,sv|
|
259 |
-
|MultiCCAligned |bg,cs,de,el,et,fi,fr,hr,hu,it,lt,lv,nl,pl,ro,sk,sv |bg,fi,fr,hr,it,lv,nl,pt |
|
260 |
-
|MultiHPLT |et,fi,ga,hr,mt | |
|
261 |
-
|MultiParaCrawl |bg,da |de,fr,ga,hr,hu,it,mt,pt | |
|
262 |
-
|MultiUN | |fr | |
|
263 |
-
|News
|
264 |
-
|NLLB |bg,da,el,et,fi,fr,gl,hu,it ,lt,lv,pt,ro,sk,sl |bg,cs,da,de,el ,et,fi,fr,hu,it,lt,lv,nl,pl,pt ,ro,sk,sl,sv|
|
265 |
-
|NTEU | |bg,cs,da,de,el ,et,fi,fr,ga,hr,hu,it,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv |
|
266 |
-
|OpenSubtitles |bg,cs,da,de,el ,et,eu,fi,gl,hr,hu,lt,lv,nl,pl,pt,ro,sk,sl,sv |da,de,fi,fr,hr,hu,it,lv,nl |
|
267 |
-
|Tatoeba |de,pt |pt |
|
268 |
-
|TildeModel | |bg |
|
269 |
-
|UNPC | |fr |
|
270 |
-
|WikiMatrix |bg,cs,da,de,el ,et,eu,fi,fr,gl,hr,hu,it,lt,nl,pl,pt,ro,sk,sl,sv |bg,fr,hr,it,pt |
|
271 |
-
|XLENT |eu,ga,gl |ga |
|
272 |
|
273 |
|
274 |
|
@@ -292,9 +292,6 @@ The dataset has been created by the Machine Translation sub-group of the Languag
|
|
292 |
Supercomputación (BSC-CNS), which aims to advance the field of natural language processing through cutting-edge research and development
|
293 |
and the use of HPC. In particular, the main contributors were Audrey Mash and Francesca De Luca Fornaciari.
|
294 |
|
295 |
-
However, the creation of the dataset would not have been possible without the collaboration of a large number of collaborators, partners,
|
296 |
-
and public institutions, which can be found in detail in the acknowledgements.
|
297 |
-
|
298 |
**Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
|
299 |
|
300 |
This work/research has been promoted and financed by the Government of Catalonia through the [Aina project](https://projecteaina.cat/).
|
@@ -496,13 +493,13 @@ The dataset does not allow for external contributions.
|
|
496 |
- Gibert, O. de, Nail, G., Arefyev, N., Bañón, M., Linde, J. van der, Ji, S., Zaragoza-Bernabeu, J., Aulamo, M., Ramírez-Sánchez, G., Kutuzov, A., Pyysalo, S., Oepen, S., & Tiedemann, J. (2024). A New Massive Multilingual Dataset for High-Performance Language Technologies (No. arXiv:2403.14009). arXiv. http://arxiv.org/abs/2403.14009
|
497 |
- Koehn, P. (2005). Europarl: A Parallel Corpus for Statistical Machine Translation. Proceedings of Machine Translation Summit X: Papers, 79–86. https://aclanthology.org/2005.mtsummit-papers.11
|
498 |
- Kreutzer, J., Caswell, I., Wang, L., Wahab, A., Van Esch, D., Ulzii-Orshikh, N., Tapo, A., Subramani, N., Sokolov, A., Sikasote, C., Setyawan, M., Sarin, S., Samb, S., Sagot, B., Rivera, C., Rios, A., Papadimitriou, I., Osei, S., Suarez, P. O., … Adeyemi, M. (2022). Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics, 10, 50–72. https://doi.org/10.1162/tacl_a_00447
|
499 |
-
- Rozis, R.,Skadiņš, R (2017). Tilde MODEL - Multilingual Open Data for EU Languages.
|
500 |
- Schwenk, H., Chaudhary, V., Sun, S., Gong, H., & Guzmán, F. (2019). WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia (No. arXiv:1907.05791). arXiv. https://doi.org/10.48550/arXiv.1907.05791
|
501 |
- Schwenk, H., Wenzek, G., Edunov, S., Grave, E., & Joulin, A. (2020). CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB (No. arXiv:1911.04944). arXiv. https://doi.org/10.48550/arXiv.1911.04944
|
502 |
-
- Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., & Varga, D. (n.d.). The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages.
|
503 |
- Subramani, N., Luccioni, S., Dodge, J., & Mitchell, M. (2023). Detecting Personal Information in Training Corpora: An Analysis. In A. Ovalle, K.-W. Chang, N. Mehrabi, Y. Pruksachatkun, A. Galystan, J. Dhamala, A. Verma, T. Cao, A. Kumar, & R. Gupta (Eds.), Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023) (pp. 208–220). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.trustnlp-1.18
|
504 |
-
- Tiedemann, J. (23-25). Parallel Data, Tools and Interfaces in OPUS. In N. C. (Conference Chair), K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). European Language Resources Association (ELRA).
|
505 |
-
- Ziemski, M., Junczys-Dowmunt, M., & Pouliquen, B. (n.d.). The United Nations Parallel Corpus v1.0.
|
506 |
|
507 |
|
508 |
|
@@ -917,12 +914,12 @@ Below are the evaluation results compared to [Apertium](https://www.apertium.org
|
|
917 |
|
918 |
### Aragonese Flores+ dev
|
919 |
|
920 |
-
Below are the evaluation results on compared to [Apertium](https://www.apertium.org/), [
|
921 |
|
922 |
| | source | target | Bleu | ChrF |
|
923 |
|:-----------------------|:---------|:---------|-------:|-------:|
|
924 |
| Apertium | es | an | **65.34** | **82.00** |
|
925 |
-
|
|
926 |
| SalamandraTA-2B | es | an | 49.13 | 74.22 |
|
927 |
| Traduze | es | an | 37.43 | 69.51 |
|
928 |
| | | | | | | | | |
|
@@ -932,13 +929,13 @@ Below are the evaluation results on compared to [Apertium](https://www.apertium.
|
|
932 |
|
933 |
### Aranese Flores+ dev
|
934 |
|
935 |
-
Below are the evaluation results on compared to [Apertium](https://www.apertium.org/) and [
|
936 |
|
937 |
|
938 |
| | source | target | Bleu | ChrF |
|
939 |
|:-----------------------|:---------|:---------|-------:|-------:|
|
940 |
| Apertium | es | arn | **48.96** | **72.63** |
|
941 |
-
|
|
942 |
| SalamandraTA-2B | es | arn | 34.35 | 57.78 |
|
943 |
| | | | | | | | | |
|
944 |
| | | | | | | | | |
|
|
|
245 |
|
246 |
| Dataset | Ca-xx Languages | Es-xx Langugages |
|
247 |
|-----------------------------------------------|----------------------------------------------------------------|-----------------------------------------------|
|
248 |
+
|[CCMatrix](https://opus.nlpl.eu/CCMatrix/corpus/version/CCMatrix) |eu | |
|
249 |
+
|[DGT](https://opus.nlpl.eu/DGT/corpus/version/DGT) | |bg,cs,da,de,el ,et,fi,fr,ga,hr,hu,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv |
|
250 |
+
|[ELRC-EMEA](https://opus.nlpl.eu/ELRC-EMEA/corpus/version/ELRC-EMEA) | |bg,cs,da,hu,lt,lv,mt,pl,ro,sk,sl |
|
251 |
+
|[EMEA](https://opus.nlpl.eu/EMEA/corpus/version/EMEA) | |bg,cs,da,el,fi,hu,lt,mt,nl,pl,ro,sk,sl,sv |
|
252 |
+
|[EUBookshop](https://opus.nlpl.eu/EUbookshop/corpus/version/EUbookshop) |lt,pl,pt |cs,da,de,el,fi,fr,ga,it,lv,mt,nl,pl,pt,ro,sk,sl,sv |
|
253 |
+
|[Europarl](https://opus.nlpl.eu/Europarl/corpus/version/Europarl) | |bg,cs,da,el,fi,fr,hu,lt,lv,nl,pl,pt ,ro,sk,sl,sv |
|
254 |
+
|[Europat](https://opus.nlpl.eu/EuroPat/corpus/version/EuroPat) | |hr |
|
255 |
+
|[KDE4](https://opus.nlpl.eu/KDE4/corpus/version/KDE4) |bg,cs,da,de,el ,et,eu,fi,fr,ga,gl,hr,it,lt,lv,nl,pl,pt,ro,sk,sl,sv |bg,ga,hr |
|
256 |
+
|[GlobalVoices](https://opus.nlpl.eu/GlobalVoices/corpus/version/GlobalVoices) | bg,de,fr,it,nl,pl,pt |bg,de,fr,pt |
|
257 |
+
|[GNOME](https://opus.nlpl.eu/GNOME/corpus/version/GNOME) |eu,fr,ga,gl,pt |ga |
|
258 |
+
|[JRC-Arquis](https://opus.nlpl.eu/JRC-Acquis/corpus/version/JRC-Acquis) | |cs,da,et,fr,lt,lv,mt,nl,pl ,ro,sv|
|
259 |
+
|[MultiCCAligned](https://opus.nlpl.eu/JRC-Acquis/corpus/version/JRC-Acquis) |bg,cs,de,el,et,fi,fr,hr,hu,it,lt,lv,nl,pl,ro,sk,sv |bg,fi,fr,hr,it,lv,nl,pt |
|
260 |
+
|[MultiHPLT](https://opus.nlpl.eu/MultiHPLT/corpus/version/MultiHPLT) |et,fi,ga,hr,mt | |
|
261 |
+
|[MultiParaCrawl](https://opus.nlpl.eu/MultiParaCrawl/corpus/version/MultiParaCrawl) |bg,da |de,fr,ga,hr,hu,it,mt,pt | |
|
262 |
+
|[MultiUN](https://opus.nlpl.eu/MultiUN/corpus/version/MultiUN) | |fr | |
|
263 |
+
|[News-Commentary](https://opus.nlpl.eu/News-Commentary/corpus/version/News-Commentary) | |fr |
|
264 |
+
|[NLLB](https://opus.nlpl.eu/NLLB/corpus/version/NLLB) |bg,da,el,et,fi,fr,gl,hu,it ,lt,lv,pt,ro,sk,sl |bg,cs,da,de,el ,et,fi,fr,hu,it,lt,lv,nl,pl,pt ,ro,sk,sl,sv|
|
265 |
+
|[NTEU](https://www.elrc-share.eu/repository/search/?q=NTEU) | |bg,cs,da,de,el ,et,fi,fr,ga,hr,hu,it,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv |
|
266 |
+
|[OpenSubtitles](https://opus.nlpl.eu/OpenSubtitles/corpus/version/OpenSubtitles) |bg,cs,da,de,el ,et,eu,fi,gl,hr,hu,lt,lv,nl,pl,pt,ro,sk,sl,sv |da,de,fi,fr,hr,hu,it,lv,nl |
|
267 |
+
|[Tatoeba](https://opus.nlpl.eu/Tatoeba/corpus/version/Tatoeba) |de,pt |pt |
|
268 |
+
|[TildeModel](https://opus.nlpl.eu/TildeMODEL/corpus/version/TildeMODEL) | |bg |
|
269 |
+
|[UNPC](https://opus.nlpl.eu/UNPC/corpus/version/UNPC) | |fr |
|
270 |
+
|[WikiMatrix](https://opus.nlpl.eu/WikiMatrix/corpus/version/WikiMatrix) |bg,cs,da,de,el ,et,eu,fi,fr,gl,hr,hu,it,lt,nl,pl,pt,ro,sk,sl,sv |bg,fr,hr,it,pt |
|
271 |
+
|[XLENT](https://opus.nlpl.eu/XLEnt/corpus/version/XLEnt) |eu,ga,gl |ga |
|
272 |
|
273 |
|
274 |
|
|
|
292 |
Supercomputación (BSC-CNS), which aims to advance the field of natural language processing through cutting-edge research and development
|
293 |
and the use of HPC. In particular, the main contributors were Audrey Mash and Francesca De Luca Fornaciari.
|
294 |
|
|
|
|
|
|
|
295 |
**Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
|
296 |
|
297 |
This work/research has been promoted and financed by the Government of Catalonia through the [Aina project](https://projecteaina.cat/).
|
|
|
493 |
- Gibert, O. de, Nail, G., Arefyev, N., Bañón, M., Linde, J. van der, Ji, S., Zaragoza-Bernabeu, J., Aulamo, M., Ramírez-Sánchez, G., Kutuzov, A., Pyysalo, S., Oepen, S., & Tiedemann, J. (2024). A New Massive Multilingual Dataset for High-Performance Language Technologies (No. arXiv:2403.14009). arXiv. http://arxiv.org/abs/2403.14009
|
494 |
- Koehn, P. (2005). Europarl: A Parallel Corpus for Statistical Machine Translation. Proceedings of Machine Translation Summit X: Papers, 79–86. https://aclanthology.org/2005.mtsummit-papers.11
|
495 |
- Kreutzer, J., Caswell, I., Wang, L., Wahab, A., Van Esch, D., Ulzii-Orshikh, N., Tapo, A., Subramani, N., Sokolov, A., Sikasote, C., Setyawan, M., Sarin, S., Samb, S., Sagot, B., Rivera, C., Rios, A., Papadimitriou, I., Osei, S., Suarez, P. O., … Adeyemi, M. (2022). Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics, 10, 50–72. https://doi.org/10.1162/tacl_a_00447
|
496 |
+
- Rozis, R.,Skadiņš, R (2017). Tilde MODEL - Multilingual Open Data for EU Languages. https://aclanthology.org/W17-0235
|
497 |
- Schwenk, H., Chaudhary, V., Sun, S., Gong, H., & Guzmán, F. (2019). WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia (No. arXiv:1907.05791). arXiv. https://doi.org/10.48550/arXiv.1907.05791
|
498 |
- Schwenk, H., Wenzek, G., Edunov, S., Grave, E., & Joulin, A. (2020). CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB (No. arXiv:1911.04944). arXiv. https://doi.org/10.48550/arXiv.1911.04944
|
499 |
+
- Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., & Varga, D. (n.d.). The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages. http://www.lrec-conf.org/proceedings/lrec2006/pdf/340_pdf
|
500 |
- Subramani, N., Luccioni, S., Dodge, J., & Mitchell, M. (2023). Detecting Personal Information in Training Corpora: An Analysis. In A. Ovalle, K.-W. Chang, N. Mehrabi, Y. Pruksachatkun, A. Galystan, J. Dhamala, A. Verma, T. Cao, A. Kumar, & R. Gupta (Eds.), Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023) (pp. 208–220). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.trustnlp-1.18
|
501 |
+
- Tiedemann, J. (23-25). Parallel Data, Tools and Interfaces in OPUS. In N. C. (Conference Chair), K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper
|
502 |
+
- Ziemski, M., Junczys-Dowmunt, M., & Pouliquen, B. (n.d.). The United Nations Parallel Corpus v1.0. https://aclanthology.org/L16-1561
|
503 |
|
504 |
|
505 |
|
|
|
914 |
|
915 |
### Aragonese Flores+ dev
|
916 |
|
917 |
+
Below are the evaluation results on compared to [Apertium](https://www.apertium.org/), [Softcatalà](https://www.softcatala.org/traductor/) and [Traduze](https://traduze.aragon.es).
|
918 |
|
919 |
| | source | target | Bleu | ChrF |
|
920 |
|:-----------------------|:---------|:---------|-------:|-------:|
|
921 |
| Apertium | es | an | **65.34** | **82.00** |
|
922 |
+
| Softcatalà | es | an | 50.21 | 73.97 |
|
923 |
| SalamandraTA-2B | es | an | 49.13 | 74.22 |
|
924 |
| Traduze | es | an | 37.43 | 69.51 |
|
925 |
| | | | | | | | | |
|
|
|
929 |
|
930 |
### Aranese Flores+ dev
|
931 |
|
932 |
+
Below are the evaluation results on compared to [Apertium](https://www.apertium.org/) and [Softcatalà](https://www.softcatala.org/traductor/).
|
933 |
|
934 |
|
935 |
| | source | target | Bleu | ChrF |
|
936 |
|:-----------------------|:---------|:---------|-------:|-------:|
|
937 |
| Apertium | es | arn | **48.96** | **72.63** |
|
938 |
+
| Softcatalà | es | arn | 34.43 | 58.61 |
|
939 |
| SalamandraTA-2B | es | arn | 34.35 | 57.78 |
|
940 |
| | | | | | | | | |
|
941 |
| | | | | | | | | |
|