fdelucaf commited on
Commit
f3e276c
·
verified ·
1 Parent(s): 6e38bf4

Add missing links

Browse files
Files changed (1) hide show
  1. README.md +6 -6
README.md CHANGED
@@ -232,7 +232,7 @@ print("Generated Translations:", results_detokenized)
232
  The training corpus consists of 70 billion tokens of Catalan- and Spanish-centric parallel data, including all of the official European languages plus Catalan, Basque,
233
  Galician, Asturian, Aragonese and Aranese. It amounts to 3,157,965,012 parallel sentence pairs.
234
 
235
- This highly multilingual corpus is predominantly composed of data sourced from OPUS, with additional data taken from the NTEU project and Project Aina’s existing corpora.
236
  Where little parallel Catalan <-> xx data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using
237
  [Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). The final distribution of languages was as below:
238
 
@@ -290,7 +290,7 @@ The purpose of creating this dataset is to pre-train multilingual models on para
290
 
291
  The dataset has been created by the Machine Translation sub-group of the Language Technologies unit (LangTech) of the Barcelona Supercomputing Center - Centro Nacional de
292
  Supercomputación (BSC-CNS), which aims to advance the field of natural language processing through cutting-edge research and development
293
- and the use of HPC. In particular, the main contributors were Audrey Mash and Francesca Fornaciari.
294
 
295
  However, the creation of the dataset would not have been possible without the collaboration of a large number of collaborators, partners,
296
  and public institutions, which can be found in detail in the acknowledgements.
@@ -305,8 +305,8 @@ This work/research has been promoted and financed by the Government of Catalonia
305
 
306
  The dataset consists entirely of parallel text separated at sentence level. Specifically, data was mainly sourced from the following databases and
307
  repositories:
308
- - **Opus:** Repository which aims to provide freely available parallel datasets in order to advance work in computational linguistics and automatic translation.
309
- - **ELRC:** Repository used for documenting, storing, browsing and accessing Language Resources that are collected through the European Language Resource Coordination,
310
 
311
  **How many instances are there in total (of each type, if appropriate)?**
312
 
@@ -409,8 +409,8 @@ ethical and legal point of view, respectively.
409
  **Was any preprocessing/cleaning/labeling of the data done? If so, please provide a description. If not, you may skip the remaining questions in this section.**
410
 
411
  All data was filtered according to two specific criteria:
412
- - Alignment - sentence level alignments were calculated using LaBSE and sentence pairs with a score below 0.75 were discarded.
413
- - Language identification - The probability of being the target language was calculated using either IdiomaCognitor or Lingua.py and sentences identified as unlikely to be the correct language were filtered out. Thresholds varied by language.
414
 
415
  **Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data? If so, please provide a link or other access point to the “raw” data.**
416
 
 
232
  The training corpus consists of 70 billion tokens of Catalan- and Spanish-centric parallel data, including all of the official European languages plus Catalan, Basque,
233
  Galician, Asturian, Aragonese and Aranese. It amounts to 3,157,965,012 parallel sentence pairs.
234
 
235
+ This highly multilingual corpus is predominantly composed of data sourced from [OPUS](https://opus.nlpl.eu/), with additional data taken from the [NTEU project](https://nteu.eu/) and Project Aina’s existing corpora.
236
  Where little parallel Catalan <-> xx data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using
237
  [Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). The final distribution of languages was as below:
238
 
 
290
 
291
  The dataset has been created by the Machine Translation sub-group of the Language Technologies unit (LangTech) of the Barcelona Supercomputing Center - Centro Nacional de
292
  Supercomputación (BSC-CNS), which aims to advance the field of natural language processing through cutting-edge research and development
293
+ and the use of HPC. In particular, the main contributors were Audrey Mash and Francesca De Luca Fornaciari.
294
 
295
  However, the creation of the dataset would not have been possible without the collaboration of a large number of collaborators, partners,
296
  and public institutions, which can be found in detail in the acknowledgements.
 
305
 
306
  The dataset consists entirely of parallel text separated at sentence level. Specifically, data was mainly sourced from the following databases and
307
  repositories:
308
+ - **[Opus](https://opus.nlpl.eu/):** Repository which aims to provide freely available parallel datasets in order to advance work in computational linguistics and automatic translation.
309
+ - **[ELRC-SHARE](https://www.elrc-share.eu/):** Repository used for documenting, storing, browsing and accessing Language Resources that are collected through the European Language Resource Coordination.
310
 
311
  **How many instances are there in total (of each type, if appropriate)?**
312
 
 
409
  **Was any preprocessing/cleaning/labeling of the data done? If so, please provide a description. If not, you may skip the remaining questions in this section.**
410
 
411
  All data was filtered according to two specific criteria:
412
+ - Alignment - sentence level alignments were calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) and sentence pairs with a score below 0.75 were discarded.
413
+ - Language identification - The probability of being the target language was calculated using either [Idiomata Cognitor](https://github.com/transducens/idiomata_cognitor) or [Lingua.py](https://github.com/pemistahl/lingua-py) and sentences identified as unlikely to be the correct language were filtered out. Thresholds varied by language.
414
 
415
  **Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data? If so, please provide a link or other access point to the “raw” data.**
416