fdelucaf commited on
Commit
8e1d69f
·
verified ·
1 Parent(s): f075c42

Add data sources links

Browse files
Files changed (1) hide show
  1. README.md +32 -35
README.md CHANGED
@@ -245,30 +245,30 @@ Click the expand button below to see the full list of corpora included in the tr
245
 
246
  | Dataset | Ca-xx Languages | Es-xx Langugages |
247
  |-----------------------------------------------|----------------------------------------------------------------|-----------------------------------------------|
248
- |CCMatrix |eu | |
249
- |DGT | |bg,cs,da,de,el ,et,fi,fr,ga,hr,hu,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv |
250
- |ELRC-EMEA | |bg,cs,da,hu,lt,lv,mt,pl,ro,sk,sl |
251
- |EMEA | |bg,cs,da,el,fi,hu,lt,mt,nl,pl,ro,sk,sl,sv |
252
- |EUBookshop |lt,pl,pt |cs,da,de,el,fi,fr,ga,it,lv,mt,nl,pl,pt,ro,sk,sl,sv |
253
- |Europarl | |bg,cs,da,el,fi,fr,hu,lt,lv,nl,pl,pt ,ro,sk,sl,sv |
254
- |Europt | |hr |
255
- |KDE4 |bg,cs,da,de,el ,et,eu,fi,fr,ga,gl,hr,it,lt,lv,nl,pl,pt,ro,sk,sl,sv |bg,ga,hr |
256
- |Global Voices | bg,de,fr,it,nl,pl,pt |bg,de,fr,pt |
257
- |GNOME |eu,fr,ga,gl,pt |ga |
258
- |JRC-Arquis | |cs,da,et,fr,lt,lv,mt,nl,pl ,ro,sv|
259
- |MultiCCAligned |bg,cs,de,el,et,fi,fr,hr,hu,it,lt,lv,nl,pl,ro,sk,sv |bg,fi,fr,hr,it,lv,nl,pt |
260
- |MultiHPLT |et,fi,ga,hr,mt | |
261
- |MultiParaCrawl |bg,da |de,fr,ga,hr,hu,it,mt,pt | |
262
- |MultiUN | |fr | |
263
- |News Commentary | |fr |
264
- |NLLB |bg,da,el,et,fi,fr,gl,hu,it ,lt,lv,pt,ro,sk,sl |bg,cs,da,de,el ,et,fi,fr,hu,it,lt,lv,nl,pl,pt ,ro,sk,sl,sv|
265
- |NTEU | |bg,cs,da,de,el ,et,fi,fr,ga,hr,hu,it,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv |
266
- |OpenSubtitles |bg,cs,da,de,el ,et,eu,fi,gl,hr,hu,lt,lv,nl,pl,pt,ro,sk,sl,sv |da,de,fi,fr,hr,hu,it,lv,nl |
267
- |Tatoeba |de,pt |pt |
268
- |TildeModel | |bg |
269
- |UNPC | |fr |
270
- |WikiMatrix |bg,cs,da,de,el ,et,eu,fi,fr,gl,hr,hu,it,lt,nl,pl,pt,ro,sk,sl,sv |bg,fr,hr,it,pt |
271
- |XLENT |eu,ga,gl |ga |
272
 
273
 
274
 
@@ -292,9 +292,6 @@ The dataset has been created by the Machine Translation sub-group of the Languag
292
  Supercomputación (BSC-CNS), which aims to advance the field of natural language processing through cutting-edge research and development
293
  and the use of HPC. In particular, the main contributors were Audrey Mash and Francesca De Luca Fornaciari.
294
 
295
- However, the creation of the dataset would not have been possible without the collaboration of a large number of collaborators, partners,
296
- and public institutions, which can be found in detail in the acknowledgements.
297
-
298
  **Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
299
 
300
  This work/research has been promoted and financed by the Government of Catalonia through the [Aina project](https://projecteaina.cat/).
@@ -496,13 +493,13 @@ The dataset does not allow for external contributions.
496
  - Gibert, O. de, Nail, G., Arefyev, N., Bañón, M., Linde, J. van der, Ji, S., Zaragoza-Bernabeu, J., Aulamo, M., Ramírez-Sánchez, G., Kutuzov, A., Pyysalo, S., Oepen, S., & Tiedemann, J. (2024). A New Massive Multilingual Dataset for High-Performance Language Technologies (No. arXiv:2403.14009). arXiv. http://arxiv.org/abs/2403.14009
497
  - Koehn, P. (2005). Europarl: A Parallel Corpus for Statistical Machine Translation. Proceedings of Machine Translation Summit X: Papers, 79–86. https://aclanthology.org/2005.mtsummit-papers.11
498
  - Kreutzer, J., Caswell, I., Wang, L., Wahab, A., Van Esch, D., Ulzii-Orshikh, N., Tapo, A., Subramani, N., Sokolov, A., Sikasote, C., Setyawan, M., Sarin, S., Samb, S., Sagot, B., Rivera, C., Rios, A., Papadimitriou, I., Osei, S., Suarez, P. O., … Adeyemi, M. (2022). Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics, 10, 50–72. https://doi.org/10.1162/tacl_a_00447
499
- - Rozis, R.,Skadiņš, R (2017). Tilde MODEL - Multilingual Open Data for EU Languages.
500
  - Schwenk, H., Chaudhary, V., Sun, S., Gong, H., & Guzmán, F. (2019). WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia (No. arXiv:1907.05791). arXiv. https://doi.org/10.48550/arXiv.1907.05791
501
  - Schwenk, H., Wenzek, G., Edunov, S., Grave, E., & Joulin, A. (2020). CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB (No. arXiv:1911.04944). arXiv. https://doi.org/10.48550/arXiv.1911.04944
502
- - Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., & Varga, D. (n.d.). The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages.
503
  - Subramani, N., Luccioni, S., Dodge, J., & Mitchell, M. (2023). Detecting Personal Information in Training Corpora: An Analysis. In A. Ovalle, K.-W. Chang, N. Mehrabi, Y. Pruksachatkun, A. Galystan, J. Dhamala, A. Verma, T. Cao, A. Kumar, & R. Gupta (Eds.), Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023) (pp. 208–220). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.trustnlp-1.18
504
- - Tiedemann, J. (23-25). Parallel Data, Tools and Interfaces in OPUS. In N. C. (Conference Chair), K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). European Language Resources Association (ELRA).
505
- - Ziemski, M., Junczys-Dowmunt, M., & Pouliquen, B. (n.d.). The United Nations Parallel Corpus v1.0.
506
 
507
 
508
 
@@ -917,12 +914,12 @@ Below are the evaluation results compared to [Apertium](https://www.apertium.org
917
 
918
  ### Aragonese Flores+ dev
919
 
920
- Below are the evaluation results on compared to [Apertium](https://www.apertium.org/), [Softcatala](https://www.softcatala.org/traductor/) and [Traduze](https://traduze.aragon.es).
921
 
922
  | | source | target | Bleu | ChrF |
923
  |:-----------------------|:---------|:---------|-------:|-------:|
924
  | Apertium | es | an | **65.34** | **82.00** |
925
- | Softcatala | es | an | 50.21 | 73.97 |
926
  | SalamandraTA-2B | es | an | 49.13 | 74.22 |
927
  | Traduze | es | an | 37.43 | 69.51 |
928
  | | | | | | | | | |
@@ -932,13 +929,13 @@ Below are the evaluation results on compared to [Apertium](https://www.apertium.
932
 
933
  ### Aranese Flores+ dev
934
 
935
- Below are the evaluation results on compared to [Apertium](https://www.apertium.org/) and [Softcatala](https://www.softcatala.org/traductor/).
936
 
937
 
938
  | | source | target | Bleu | ChrF |
939
  |:-----------------------|:---------|:---------|-------:|-------:|
940
  | Apertium | es | arn | **48.96** | **72.63** |
941
- | Softcatala | es | arn | 34.43 | 58.61 |
942
  | SalamandraTA-2B | es | arn | 34.35 | 57.78 |
943
  | | | | | | | | | |
944
  | | | | | | | | | |
 
245
 
246
  | Dataset | Ca-xx Languages | Es-xx Langugages |
247
  |-----------------------------------------------|----------------------------------------------------------------|-----------------------------------------------|
248
+ |[CCMatrix](https://opus.nlpl.eu/CCMatrix/corpus/version/CCMatrix) |eu | |
249
+ |[DGT](https://opus.nlpl.eu/DGT/corpus/version/DGT) | |bg,cs,da,de,el ,et,fi,fr,ga,hr,hu,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv |
250
+ |[ELRC-EMEA](https://opus.nlpl.eu/ELRC-EMEA/corpus/version/ELRC-EMEA) | |bg,cs,da,hu,lt,lv,mt,pl,ro,sk,sl |
251
+ |[EMEA](https://opus.nlpl.eu/EMEA/corpus/version/EMEA) | |bg,cs,da,el,fi,hu,lt,mt,nl,pl,ro,sk,sl,sv |
252
+ |[EUBookshop](https://opus.nlpl.eu/EUbookshop/corpus/version/EUbookshop) |lt,pl,pt |cs,da,de,el,fi,fr,ga,it,lv,mt,nl,pl,pt,ro,sk,sl,sv |
253
+ |[Europarl](https://opus.nlpl.eu/Europarl/corpus/version/Europarl) | |bg,cs,da,el,fi,fr,hu,lt,lv,nl,pl,pt ,ro,sk,sl,sv |
254
+ |[Europat](https://opus.nlpl.eu/EuroPat/corpus/version/EuroPat) | |hr |
255
+ |[KDE4](https://opus.nlpl.eu/KDE4/corpus/version/KDE4) |bg,cs,da,de,el ,et,eu,fi,fr,ga,gl,hr,it,lt,lv,nl,pl,pt,ro,sk,sl,sv |bg,ga,hr |
256
+ |[GlobalVoices](https://opus.nlpl.eu/GlobalVoices/corpus/version/GlobalVoices) | bg,de,fr,it,nl,pl,pt |bg,de,fr,pt |
257
+ |[GNOME](https://opus.nlpl.eu/GNOME/corpus/version/GNOME) |eu,fr,ga,gl,pt |ga |
258
+ |[JRC-Arquis](https://opus.nlpl.eu/JRC-Acquis/corpus/version/JRC-Acquis) | |cs,da,et,fr,lt,lv,mt,nl,pl ,ro,sv|
259
+ |[MultiCCAligned](https://opus.nlpl.eu/JRC-Acquis/corpus/version/JRC-Acquis) |bg,cs,de,el,et,fi,fr,hr,hu,it,lt,lv,nl,pl,ro,sk,sv |bg,fi,fr,hr,it,lv,nl,pt |
260
+ |[MultiHPLT](https://opus.nlpl.eu/MultiHPLT/corpus/version/MultiHPLT) |et,fi,ga,hr,mt | |
261
+ |[MultiParaCrawl](https://opus.nlpl.eu/MultiParaCrawl/corpus/version/MultiParaCrawl) |bg,da |de,fr,ga,hr,hu,it,mt,pt | |
262
+ |[MultiUN](https://opus.nlpl.eu/MultiUN/corpus/version/MultiUN) | |fr | |
263
+ |[News-Commentary](https://opus.nlpl.eu/News-Commentary/corpus/version/News-Commentary) | |fr |
264
+ |[NLLB](https://opus.nlpl.eu/NLLB/corpus/version/NLLB) |bg,da,el,et,fi,fr,gl,hu,it ,lt,lv,pt,ro,sk,sl |bg,cs,da,de,el ,et,fi,fr,hu,it,lt,lv,nl,pl,pt ,ro,sk,sl,sv|
265
+ |[NTEU](https://www.elrc-share.eu/repository/search/?q=NTEU) | |bg,cs,da,de,el ,et,fi,fr,ga,hr,hu,it,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv |
266
+ |[OpenSubtitles](https://opus.nlpl.eu/OpenSubtitles/corpus/version/OpenSubtitles) |bg,cs,da,de,el ,et,eu,fi,gl,hr,hu,lt,lv,nl,pl,pt,ro,sk,sl,sv |da,de,fi,fr,hr,hu,it,lv,nl |
267
+ |[Tatoeba](https://opus.nlpl.eu/Tatoeba/corpus/version/Tatoeba) |de,pt |pt |
268
+ |[TildeModel](https://opus.nlpl.eu/TildeMODEL/corpus/version/TildeMODEL) | |bg |
269
+ |[UNPC](https://opus.nlpl.eu/UNPC/corpus/version/UNPC) | |fr |
270
+ |[WikiMatrix](https://opus.nlpl.eu/WikiMatrix/corpus/version/WikiMatrix) |bg,cs,da,de,el ,et,eu,fi,fr,gl,hr,hu,it,lt,nl,pl,pt,ro,sk,sl,sv |bg,fr,hr,it,pt |
271
+ |[XLENT](https://opus.nlpl.eu/XLEnt/corpus/version/XLEnt) |eu,ga,gl |ga |
272
 
273
 
274
 
 
292
  Supercomputación (BSC-CNS), which aims to advance the field of natural language processing through cutting-edge research and development
293
  and the use of HPC. In particular, the main contributors were Audrey Mash and Francesca De Luca Fornaciari.
294
 
 
 
 
295
  **Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
296
 
297
  This work/research has been promoted and financed by the Government of Catalonia through the [Aina project](https://projecteaina.cat/).
 
493
  - Gibert, O. de, Nail, G., Arefyev, N., Bañón, M., Linde, J. van der, Ji, S., Zaragoza-Bernabeu, J., Aulamo, M., Ramírez-Sánchez, G., Kutuzov, A., Pyysalo, S., Oepen, S., & Tiedemann, J. (2024). A New Massive Multilingual Dataset for High-Performance Language Technologies (No. arXiv:2403.14009). arXiv. http://arxiv.org/abs/2403.14009
494
  - Koehn, P. (2005). Europarl: A Parallel Corpus for Statistical Machine Translation. Proceedings of Machine Translation Summit X: Papers, 79–86. https://aclanthology.org/2005.mtsummit-papers.11
495
  - Kreutzer, J., Caswell, I., Wang, L., Wahab, A., Van Esch, D., Ulzii-Orshikh, N., Tapo, A., Subramani, N., Sokolov, A., Sikasote, C., Setyawan, M., Sarin, S., Samb, S., Sagot, B., Rivera, C., Rios, A., Papadimitriou, I., Osei, S., Suarez, P. O., … Adeyemi, M. (2022). Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics, 10, 50–72. https://doi.org/10.1162/tacl_a_00447
496
+ - Rozis, R.,Skadiņš, R (2017). Tilde MODEL - Multilingual Open Data for EU Languages. https://aclanthology.org/W17-0235
497
  - Schwenk, H., Chaudhary, V., Sun, S., Gong, H., & Guzmán, F. (2019). WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia (No. arXiv:1907.05791). arXiv. https://doi.org/10.48550/arXiv.1907.05791
498
  - Schwenk, H., Wenzek, G., Edunov, S., Grave, E., & Joulin, A. (2020). CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB (No. arXiv:1911.04944). arXiv. https://doi.org/10.48550/arXiv.1911.04944
499
+ - Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., & Varga, D. (n.d.). The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages. http://www.lrec-conf.org/proceedings/lrec2006/pdf/340_pdf
500
  - Subramani, N., Luccioni, S., Dodge, J., & Mitchell, M. (2023). Detecting Personal Information in Training Corpora: An Analysis. In A. Ovalle, K.-W. Chang, N. Mehrabi, Y. Pruksachatkun, A. Galystan, J. Dhamala, A. Verma, T. Cao, A. Kumar, & R. Gupta (Eds.), Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023) (pp. 208–220). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.trustnlp-1.18
501
+ - Tiedemann, J. (23-25). Parallel Data, Tools and Interfaces in OPUS. In N. C. (Conference Chair), K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper
502
+ - Ziemski, M., Junczys-Dowmunt, M., & Pouliquen, B. (n.d.). The United Nations Parallel Corpus v1.0. https://aclanthology.org/L16-1561
503
 
504
 
505
 
 
914
 
915
  ### Aragonese Flores+ dev
916
 
917
+ Below are the evaluation results on compared to [Apertium](https://www.apertium.org/), [Softcatalà](https://www.softcatala.org/traductor/) and [Traduze](https://traduze.aragon.es).
918
 
919
  | | source | target | Bleu | ChrF |
920
  |:-----------------------|:---------|:---------|-------:|-------:|
921
  | Apertium | es | an | **65.34** | **82.00** |
922
+ | Softcatalà | es | an | 50.21 | 73.97 |
923
  | SalamandraTA-2B | es | an | 49.13 | 74.22 |
924
  | Traduze | es | an | 37.43 | 69.51 |
925
  | | | | | | | | | |
 
929
 
930
  ### Aranese Flores+ dev
931
 
932
+ Below are the evaluation results on compared to [Apertium](https://www.apertium.org/) and [Softcatalà](https://www.softcatala.org/traductor/).
933
 
934
 
935
  | | source | target | Bleu | ChrF |
936
  |:-----------------------|:---------|:---------|-------:|-------:|
937
  | Apertium | es | arn | **48.96** | **72.63** |
938
+ | Softcatalà | es | arn | 34.43 | 58.61 |
939
  | SalamandraTA-2B | es | arn | 34.35 | 57.78 |
940
  | | | | | | | | | |
941
  | | | | | | | | | |