BSC-LT
/

salamandraTA-2B

@@ -245,30 +245,30 @@ Click the expand button below to see the full list of corpora included in the tr
 | Dataset                                   	| Ca-xx Languages                                                                                                	|  Es-xx Langugages                             |
 |-----------------------------------------------|----------------------------------------------------------------|-----------------------------------------------|
-|CCMatrix		|eu			|		|
-|DGT			|			|bg,cs,da,de,el	,et,fi,fr,ga,hr,hu,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv	|
-|ELRC-EMEA		|			|bg,cs,da,hu,lt,lv,mt,pl,ro,sk,sl		|
-|EMEA			|			|bg,cs,da,el,fi,hu,lt,mt,nl,pl,ro,sk,sl,sv		|
-|EUBookshop		|lt,pl,pt			|cs,da,de,el,fi,fr,ga,it,lv,mt,nl,pl,pt,ro,sk,sl,sv		|
-|Europarl		|			|bg,cs,da,el,fi,fr,hu,lt,lv,nl,pl,pt	,ro,sk,sl,sv	|
-|Europt		|			|hr		|
-|KDE4			|bg,cs,da,de,el	,et,eu,fi,fr,ga,gl,hr,it,lt,lv,nl,pl,pt,ro,sk,sl,sv	|bg,ga,hr	|
-|Global Voices		| bg,de,fr,it,nl,pl,pt	|bg,de,fr,pt		|
-|GNOME		|eu,fr,ga,gl,pt		|ga		|
-|JRC-Arquis		|			|cs,da,et,fr,lt,lv,mt,nl,pl	,ro,sv|
-|MultiCCAligned	|bg,cs,de,el,et,fi,fr,hr,hu,it,lt,lv,nl,pl,ro,sk,sv	|bg,fi,fr,hr,it,lv,nl,pt		|
-|MultiHPLT		|et,fi,ga,hr,mt		|		|
-|MultiParaCrawl	|bg,da		|de,fr,ga,hr,hu,it,mt,pt		|	|
-|MultiUN		|			|fr	|	|
-|News Commentary 	|		|fr		|
-|NLLB			|bg,da,el,et,fi,fr,gl,hu,it	,lt,lv,pt,ro,sk,sl	|bg,cs,da,de,el	,et,fi,fr,hu,it,lt,lv,nl,pl,pt	,ro,sk,sl,sv|
-|NTEU			|			|bg,cs,da,de,el	,et,fi,fr,ga,hr,hu,it,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv	|
-|OpenSubtitles 	|bg,cs,da,de,el	,et,eu,fi,gl,hr,hu,lt,lv,nl,pl,pt,ro,sk,sl,sv	|da,de,fi,fr,hr,hu,it,lv,nl		|
-|Tatoeba		|de,pt			|pt		|
-|TildeModel		|			|bg		|
-|UNPC			|			|fr		|
-|WikiMatrix		|bg,cs,da,de,el	,et,eu,fi,fr,gl,hr,hu,it,lt,nl,pl,pt,ro,sk,sl,sv	|bg,fr,hr,it,pt		|
-|XLENT		|eu,ga,gl			|ga		|
@@ -292,9 +292,6 @@ The dataset has been created by the Machine Translation sub-group of the Languag
 Supercomputación (BSC-CNS), which aims to advance the field of natural language processing through cutting-edge research and development
 and the use of HPC. In particular, the main contributors were Audrey Mash and Francesca De Luca Fornaciari.
-However, the creation of the dataset would not have been possible without the collaboration of a large number of collaborators, partners,
-and public institutions, which can be found in detail in the acknowledgements.
 **Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
 This work/research has been promoted and financed by the Government of Catalonia through the [Aina project](https://projecteaina.cat/).
@@ -496,13 +493,13 @@ The dataset does not allow for external contributions.
 - Gibert, O. de, Nail, G., Arefyev, N., Bañón, M., Linde, J. van der, Ji, S., Zaragoza-Bernabeu, J., Aulamo, M., Ramírez-Sánchez, G., Kutuzov, A., Pyysalo, S., Oepen, S., & Tiedemann, J. (2024). A New Massive Multilingual Dataset for High-Performance Language Technologies (No. arXiv:2403.14009). arXiv. http://arxiv.org/abs/2403.14009
 - Koehn, P. (2005). Europarl: A Parallel Corpus for Statistical Machine Translation. Proceedings of Machine Translation Summit X: Papers, 79–86. https://aclanthology.org/2005.mtsummit-papers.11
 - Kreutzer, J., Caswell, I., Wang, L., Wahab, A., Van Esch, D., Ulzii-Orshikh, N., Tapo, A., Subramani, N., Sokolov, A., Sikasote, C., Setyawan, M., Sarin, S., Samb, S., Sagot, B., Rivera, C., Rios, A., Papadimitriou, I., Osei, S., Suarez, P. O., … Adeyemi, M. (2022). Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics, 10, 50–72. https://doi.org/10.1162/tacl_a_00447
-- Rozis, R.,Skadiņš, R (2017). Tilde MODEL - Multilingual Open Data for EU Languages.
 - Schwenk, H., Chaudhary, V., Sun, S., Gong, H., & Guzmán, F. (2019). WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia (No. arXiv:1907.05791). arXiv. https://doi.org/10.48550/arXiv.1907.05791
 - Schwenk, H., Wenzek, G., Edunov, S., Grave, E., & Joulin, A. (2020). CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB (No. arXiv:1911.04944). arXiv. https://doi.org/10.48550/arXiv.1911.04944
-- Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., & Varga, D. (n.d.). The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages.
 - Subramani, N., Luccioni, S., Dodge, J., & Mitchell, M. (2023). Detecting Personal Information in Training Corpora: An Analysis. In A. Ovalle, K.-W. Chang, N. Mehrabi, Y. Pruksachatkun, A. Galystan, J. Dhamala, A. Verma, T. Cao, A. Kumar, & R. Gupta (Eds.), Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023) (pp. 208–220). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.trustnlp-1.18
-- Tiedemann, J. (23-25). Parallel Data, Tools and Interfaces in OPUS. In N. C. (Conference Chair), K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). European Language Resources Association (ELRA).
-- Ziemski, M., Junczys-Dowmunt, M., & Pouliquen, B. (n.d.). The United Nations Parallel Corpus v1.0.
@@ -917,12 +914,12 @@ Below are the evaluation results compared to [Apertium](https://www.apertium.org
 ### Aragonese Flores+ dev
-Below are the evaluation results on compared to [Apertium](https://www.apertium.org/), [Softcatala](https://www.softcatala.org/traductor/) and [Traduze](https://traduze.aragon.es).
 |             | source   | target   |   Bleu |    ChrF |
 |:-----------------------|:---------|:---------|-------:|-------:|
 | Apertium | es       | an      |  **65.34** |  **82.00** |
-| Softcatala | es       | an      |  50.21 |  73.97 |
 | SalamandraTA-2B | es       | an      |  49.13 |  74.22 |
 | Traduze | es       | an      |  37.43 |  69.51 |
 |  | | | | | | | | |
@@ -932,13 +929,13 @@ Below are the evaluation results on compared to [Apertium](https://www.apertium.
 ### Aranese Flores+ dev
-Below are the evaluation results on compared to [Apertium](https://www.apertium.org/) and [Softcatala](https://www.softcatala.org/traductor/).
 |             | source   | target   |   Bleu |    ChrF |
 |:-----------------------|:---------|:---------|-------:|-------:|
 | Apertium | es       | arn      |  **48.96** |  **72.63** |
-| Softcatala | es       | arn      |  34.43 |  58.61 |
 | SalamandraTA-2B | es       | arn      |  34.35 |  57.78 |
 |  | | | | | | | | |
 |  | | | | | | | | |

 | Dataset                                   	| Ca-xx Languages                                                                                                	|  Es-xx Langugages                             |
 |-----------------------------------------------|----------------------------------------------------------------|-----------------------------------------------|
+|[CCMatrix](https://opus.nlpl.eu/CCMatrix/corpus/version/CCMatrix)		|eu			|		|
+|[DGT](https://opus.nlpl.eu/DGT/corpus/version/DGT)			|			|bg,cs,da,de,el	,et,fi,fr,ga,hr,hu,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv	|
+|[ELRC-EMEA](https://opus.nlpl.eu/ELRC-EMEA/corpus/version/ELRC-EMEA)		|			|bg,cs,da,hu,lt,lv,mt,pl,ro,sk,sl		|
+|[EMEA](https://opus.nlpl.eu/EMEA/corpus/version/EMEA)			|			|bg,cs,da,el,fi,hu,lt,mt,nl,pl,ro,sk,sl,sv		|
+|[EUBookshop](https://opus.nlpl.eu/EUbookshop/corpus/version/EUbookshop)		|lt,pl,pt			|cs,da,de,el,fi,fr,ga,it,lv,mt,nl,pl,pt,ro,sk,sl,sv		|
+|[Europarl](https://opus.nlpl.eu/Europarl/corpus/version/Europarl)		|			|bg,cs,da,el,fi,fr,hu,lt,lv,nl,pl,pt	,ro,sk,sl,sv	|
+|[Europat](https://opus.nlpl.eu/EuroPat/corpus/version/EuroPat)		|			|hr		|
+|[KDE4](https://opus.nlpl.eu/KDE4/corpus/version/KDE4)			|bg,cs,da,de,el	,et,eu,fi,fr,ga,gl,hr,it,lt,lv,nl,pl,pt,ro,sk,sl,sv	|bg,ga,hr	|
+|[GlobalVoices](https://opus.nlpl.eu/GlobalVoices/corpus/version/GlobalVoices)		| bg,de,fr,it,nl,pl,pt	|bg,de,fr,pt		|
+|[GNOME](https://opus.nlpl.eu/GNOME/corpus/version/GNOME)		|eu,fr,ga,gl,pt		|ga		|
+|[JRC-Arquis](https://opus.nlpl.eu/JRC-Acquis/corpus/version/JRC-Acquis)		|			|cs,da,et,fr,lt,lv,mt,nl,pl	,ro,sv|
+|[MultiCCAligned](https://opus.nlpl.eu/JRC-Acquis/corpus/version/JRC-Acquis)	|bg,cs,de,el,et,fi,fr,hr,hu,it,lt,lv,nl,pl,ro,sk,sv	|bg,fi,fr,hr,it,lv,nl,pt		|
+|[MultiHPLT](https://opus.nlpl.eu/MultiHPLT/corpus/version/MultiHPLT)		|et,fi,ga,hr,mt		|		|
+|[MultiParaCrawl](https://opus.nlpl.eu/MultiParaCrawl/corpus/version/MultiParaCrawl)	|bg,da		|de,fr,ga,hr,hu,it,mt,pt		|	|
+|[MultiUN](https://opus.nlpl.eu/MultiUN/corpus/version/MultiUN)		|			|fr	|	|
+|[News-Commentary](https://opus.nlpl.eu/News-Commentary/corpus/version/News-Commentary) 	|		|fr		|
+|[NLLB](https://opus.nlpl.eu/NLLB/corpus/version/NLLB)			|bg,da,el,et,fi,fr,gl,hu,it	,lt,lv,pt,ro,sk,sl	|bg,cs,da,de,el	,et,fi,fr,hu,it,lt,lv,nl,pl,pt	,ro,sk,sl,sv|
+|[NTEU](https://www.elrc-share.eu/repository/search/?q=NTEU)			|			|bg,cs,da,de,el	,et,fi,fr,ga,hr,hu,it,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv	|
+|[OpenSubtitles](https://opus.nlpl.eu/OpenSubtitles/corpus/version/OpenSubtitles) 	|bg,cs,da,de,el	,et,eu,fi,gl,hr,hu,lt,lv,nl,pl,pt,ro,sk,sl,sv	|da,de,fi,fr,hr,hu,it,lv,nl		|
+|[Tatoeba](https://opus.nlpl.eu/Tatoeba/corpus/version/Tatoeba)		|de,pt			|pt		|
+|[TildeModel](https://opus.nlpl.eu/TildeMODEL/corpus/version/TildeMODEL)		|			|bg		|
+|[UNPC](https://opus.nlpl.eu/UNPC/corpus/version/UNPC)			|			|fr		|
+|[WikiMatrix](https://opus.nlpl.eu/WikiMatrix/corpus/version/WikiMatrix)		|bg,cs,da,de,el	,et,eu,fi,fr,gl,hr,hu,it,lt,nl,pl,pt,ro,sk,sl,sv	|bg,fr,hr,it,pt		|
+|[XLENT](https://opus.nlpl.eu/XLEnt/corpus/version/XLEnt)		|eu,ga,gl			|ga		|
 Supercomputación (BSC-CNS), which aims to advance the field of natural language processing through cutting-edge research and development
 and the use of HPC. In particular, the main contributors were Audrey Mash and Francesca De Luca Fornaciari.
 **Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
 This work/research has been promoted and financed by the Government of Catalonia through the [Aina project](https://projecteaina.cat/).
 - Gibert, O. de, Nail, G., Arefyev, N., Bañón, M., Linde, J. van der, Ji, S., Zaragoza-Bernabeu, J., Aulamo, M., Ramírez-Sánchez, G., Kutuzov, A., Pyysalo, S., Oepen, S., & Tiedemann, J. (2024). A New Massive Multilingual Dataset for High-Performance Language Technologies (No. arXiv:2403.14009). arXiv. http://arxiv.org/abs/2403.14009
 - Koehn, P. (2005). Europarl: A Parallel Corpus for Statistical Machine Translation. Proceedings of Machine Translation Summit X: Papers, 79–86. https://aclanthology.org/2005.mtsummit-papers.11
 - Kreutzer, J., Caswell, I., Wang, L., Wahab, A., Van Esch, D., Ulzii-Orshikh, N., Tapo, A., Subramani, N., Sokolov, A., Sikasote, C., Setyawan, M., Sarin, S., Samb, S., Sagot, B., Rivera, C., Rios, A., Papadimitriou, I., Osei, S., Suarez, P. O., … Adeyemi, M. (2022). Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics, 10, 50–72. https://doi.org/10.1162/tacl_a_00447
+- Rozis, R.,Skadiņš, R (2017). Tilde MODEL - Multilingual Open Data for EU Languages. https://aclanthology.org/W17-0235
 - Schwenk, H., Chaudhary, V., Sun, S., Gong, H., & Guzmán, F. (2019). WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia (No. arXiv:1907.05791). arXiv. https://doi.org/10.48550/arXiv.1907.05791
 - Schwenk, H., Wenzek, G., Edunov, S., Grave, E., & Joulin, A. (2020). CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB (No. arXiv:1911.04944). arXiv. https://doi.org/10.48550/arXiv.1911.04944
+- Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., & Varga, D. (n.d.). The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages. http://www.lrec-conf.org/proceedings/lrec2006/pdf/340_pdf
 - Subramani, N., Luccioni, S., Dodge, J., & Mitchell, M. (2023). Detecting Personal Information in Training Corpora: An Analysis. In A. Ovalle, K.-W. Chang, N. Mehrabi, Y. Pruksachatkun, A. Galystan, J. Dhamala, A. Verma, T. Cao, A. Kumar, & R. Gupta (Eds.), Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023) (pp. 208–220). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.trustnlp-1.18
+- Tiedemann, J. (23-25). Parallel Data, Tools and Interfaces in OPUS. In N. C. (Conference Chair), K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper
+- Ziemski, M., Junczys-Dowmunt, M., & Pouliquen, B. (n.d.). The United Nations Parallel Corpus v1.0. https://aclanthology.org/L16-1561
 ### Aragonese Flores+ dev
+Below are the evaluation results on compared to [Apertium](https://www.apertium.org/), [Softcatalà](https://www.softcatala.org/traductor/) and [Traduze](https://traduze.aragon.es).
 |             | source   | target   |   Bleu |    ChrF |
 |:-----------------------|:---------|:---------|-------:|-------:|
 | Apertium | es       | an      |  **65.34** |  **82.00** |
+| Softcatalà | es       | an      |  50.21 |  73.97 |
 | SalamandraTA-2B | es       | an      |  49.13 |  74.22 |
 | Traduze | es       | an      |  37.43 |  69.51 |
 |  | | | | | | | | |
 ### Aranese Flores+ dev
+Below are the evaluation results on compared to [Apertium](https://www.apertium.org/) and [Softcatalà](https://www.softcatala.org/traductor/).
 |             | source   | target   |   Bleu |    ChrF |
 |:-----------------------|:---------|:---------|-------:|-------:|
 | Apertium | es       | arn      |  **48.96** |  **72.63** |
+| Softcatalà | es       | arn      |  34.43 |  58.61 |
 | SalamandraTA-2B | es       | arn      |  34.35 |  57.78 |
 |  | | | | | | | | |
 |  | | | | | | | | |