Add missing links
Browse files
README.md
CHANGED
@@ -232,7 +232,7 @@ print("Generated Translations:", results_detokenized)
|
|
232 |
The training corpus consists of 70 billion tokens of Catalan- and Spanish-centric parallel data, including all of the official European languages plus Catalan, Basque,
|
233 |
Galician, Asturian, Aragonese and Aranese. It amounts to 3,157,965,012 parallel sentence pairs.
|
234 |
|
235 |
-
This highly multilingual corpus is predominantly composed of data sourced from OPUS, with additional data taken from the NTEU project and Project Aina’s existing corpora.
|
236 |
Where little parallel Catalan <-> xx data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using
|
237 |
[Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). The final distribution of languages was as below:
|
238 |
|
@@ -290,7 +290,7 @@ The purpose of creating this dataset is to pre-train multilingual models on para
|
|
290 |
|
291 |
The dataset has been created by the Machine Translation sub-group of the Language Technologies unit (LangTech) of the Barcelona Supercomputing Center - Centro Nacional de
|
292 |
Supercomputación (BSC-CNS), which aims to advance the field of natural language processing through cutting-edge research and development
|
293 |
-
and the use of HPC. In particular, the main contributors were Audrey Mash and Francesca Fornaciari.
|
294 |
|
295 |
However, the creation of the dataset would not have been possible without the collaboration of a large number of collaborators, partners,
|
296 |
and public institutions, which can be found in detail in the acknowledgements.
|
@@ -305,8 +305,8 @@ This work/research has been promoted and financed by the Government of Catalonia
|
|
305 |
|
306 |
The dataset consists entirely of parallel text separated at sentence level. Specifically, data was mainly sourced from the following databases and
|
307 |
repositories:
|
308 |
-
- **Opus:** Repository which aims to provide freely available parallel datasets in order to advance work in computational linguistics and automatic translation.
|
309 |
-
- **ELRC:** Repository used for documenting, storing, browsing and accessing Language Resources that are collected through the European Language Resource Coordination
|
310 |
|
311 |
**How many instances are there in total (of each type, if appropriate)?**
|
312 |
|
@@ -409,8 +409,8 @@ ethical and legal point of view, respectively.
|
|
409 |
**Was any preprocessing/cleaning/labeling of the data done? If so, please provide a description. If not, you may skip the remaining questions in this section.**
|
410 |
|
411 |
All data was filtered according to two specific criteria:
|
412 |
-
- Alignment - sentence level alignments were calculated using LaBSE and sentence pairs with a score below 0.75 were discarded.
|
413 |
-
- Language identification - The probability of being the target language was calculated using either
|
414 |
|
415 |
**Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data? If so, please provide a link or other access point to the “raw” data.**
|
416 |
|
|
|
232 |
The training corpus consists of 70 billion tokens of Catalan- and Spanish-centric parallel data, including all of the official European languages plus Catalan, Basque,
|
233 |
Galician, Asturian, Aragonese and Aranese. It amounts to 3,157,965,012 parallel sentence pairs.
|
234 |
|
235 |
+
This highly multilingual corpus is predominantly composed of data sourced from [OPUS](https://opus.nlpl.eu/), with additional data taken from the [NTEU project](https://nteu.eu/) and Project Aina’s existing corpora.
|
236 |
Where little parallel Catalan <-> xx data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using
|
237 |
[Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). The final distribution of languages was as below:
|
238 |
|
|
|
290 |
|
291 |
The dataset has been created by the Machine Translation sub-group of the Language Technologies unit (LangTech) of the Barcelona Supercomputing Center - Centro Nacional de
|
292 |
Supercomputación (BSC-CNS), which aims to advance the field of natural language processing through cutting-edge research and development
|
293 |
+
and the use of HPC. In particular, the main contributors were Audrey Mash and Francesca De Luca Fornaciari.
|
294 |
|
295 |
However, the creation of the dataset would not have been possible without the collaboration of a large number of collaborators, partners,
|
296 |
and public institutions, which can be found in detail in the acknowledgements.
|
|
|
305 |
|
306 |
The dataset consists entirely of parallel text separated at sentence level. Specifically, data was mainly sourced from the following databases and
|
307 |
repositories:
|
308 |
+
- **[Opus](https://opus.nlpl.eu/):** Repository which aims to provide freely available parallel datasets in order to advance work in computational linguistics and automatic translation.
|
309 |
+
- **[ELRC-SHARE](https://www.elrc-share.eu/):** Repository used for documenting, storing, browsing and accessing Language Resources that are collected through the European Language Resource Coordination.
|
310 |
|
311 |
**How many instances are there in total (of each type, if appropriate)?**
|
312 |
|
|
|
409 |
**Was any preprocessing/cleaning/labeling of the data done? If so, please provide a description. If not, you may skip the remaining questions in this section.**
|
410 |
|
411 |
All data was filtered according to two specific criteria:
|
412 |
+
- Alignment - sentence level alignments were calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) and sentence pairs with a score below 0.75 were discarded.
|
413 |
+
- Language identification - The probability of being the target language was calculated using either [Idiomata Cognitor](https://github.com/transducens/idiomata_cognitor) or [Lingua.py](https://github.com/pemistahl/lingua-py) and sentences identified as unlikely to be the correct language were filtered out. Thresholds varied by language.
|
414 |
|
415 |
**Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data? If so, please provide a link or other access point to the “raw” data.**
|
416 |
|