BSC-LT
/

salamandraTA-2B

@@ -232,7 +232,7 @@ print("Generated Translations:", results_detokenized)
 The training corpus consists of 70 billion tokens of Catalan- and Spanish-centric parallel data, including all of the official European languages plus Catalan, Basque,
 Galician, Asturian, Aragonese and Aranese. It amounts to 3,157,965,012 parallel sentence pairs.
-This highly multilingual corpus is predominantly composed of data sourced from OPUS, with additional data taken from the NTEU project and Project Aina’s existing corpora.
 Where little parallel Catalan <-> xx data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using
 [Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). The final distribution of languages was as below:
@@ -290,7 +290,7 @@ The purpose of creating this dataset is to pre-train multilingual models on para
 The dataset has been created by the Machine Translation sub-group of the Language Technologies unit (LangTech) of the Barcelona Supercomputing Center - Centro Nacional de
 Supercomputación (BSC-CNS), which aims to advance the field of natural language processing through cutting-edge research and development
-and the use of HPC. In particular, the main contributors were Audrey Mash and Francesca Fornaciari.
 However, the creation of the dataset would not have been possible without the collaboration of a large number of collaborators, partners,
 and public institutions, which can be found in detail in the acknowledgements.
@@ -305,8 +305,8 @@ This work/research has been promoted and financed by the Government of Catalonia
 The dataset consists entirely of parallel text separated at sentence level. Specifically, data was mainly sourced from the following databases and
 repositories:
-- **Opus:** Repository which aims to provide freely available parallel datasets in order to advance work in computational linguistics and automatic translation.
-- **ELRC:** Repository used for documenting, storing, browsing and accessing Language Resources that are collected through the European Language Resource Coordination,
 **How many instances are there in total (of each type, if appropriate)?**
@@ -409,8 +409,8 @@ ethical and legal point of view, respectively.
 **Was any preprocessing/cleaning/labeling of the data done? If so, please provide a description. If not, you may skip the remaining questions in this section.**
 All data was filtered according to two specific criteria:
-- Alignment - sentence level alignments were calculated using LaBSE and sentence pairs with a score below 0.75 were discarded.
-- Language identification - The probability of being the target language was calculated using either IdiomaCognitor or Lingua.py and sentences identified as unlikely to be the correct language were filtered out. Thresholds varied by language.
 **Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data? If so, please provide a link or other access point to the “raw” data.**

 The training corpus consists of 70 billion tokens of Catalan- and Spanish-centric parallel data, including all of the official European languages plus Catalan, Basque,
 Galician, Asturian, Aragonese and Aranese. It amounts to 3,157,965,012 parallel sentence pairs.
+This highly multilingual corpus is predominantly composed of data sourced from [OPUS](https://opus.nlpl.eu/), with additional data taken from the [NTEU project](https://nteu.eu/) and Project Aina’s existing corpora.
 Where little parallel Catalan <-> xx data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using
 [Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). The final distribution of languages was as below:
 The dataset has been created by the Machine Translation sub-group of the Language Technologies unit (LangTech) of the Barcelona Supercomputing Center - Centro Nacional de
 Supercomputación (BSC-CNS), which aims to advance the field of natural language processing through cutting-edge research and development
+and the use of HPC. In particular, the main contributors were Audrey Mash and Francesca De Luca Fornaciari.
 However, the creation of the dataset would not have been possible without the collaboration of a large number of collaborators, partners,
 and public institutions, which can be found in detail in the acknowledgements.
 The dataset consists entirely of parallel text separated at sentence level. Specifically, data was mainly sourced from the following databases and
 repositories:
+- **[Opus](https://opus.nlpl.eu/):** Repository which aims to provide freely available parallel datasets in order to advance work in computational linguistics and automatic translation.
+- **[ELRC-SHARE](https://www.elrc-share.eu/):** Repository used for documenting, storing, browsing and accessing Language Resources that are collected through the European Language Resource Coordination.
 **How many instances are there in total (of each type, if appropriate)?**
 **Was any preprocessing/cleaning/labeling of the data done? If so, please provide a description. If not, you may skip the remaining questions in this section.**
 All data was filtered according to two specific criteria:
+- Alignment - sentence level alignments were calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) and sentence pairs with a score below 0.75 were discarded.
+- Language identification - The probability of being the target language was calculated using either [Idiomata Cognitor](https://github.com/transducens/idiomata_cognitor) or [Lingua.py](https://github.com/pemistahl/lingua-py) and sentences identified as unlikely to be the correct language were filtered out. Thresholds varied by language.
 **Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data? If so, please provide a link or other access point to the “raw” data.**