AudreyVM commited on
Commit
80a2ca8
·
verified ·
1 Parent(s): 0710178

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -25
README.md CHANGED
@@ -233,7 +233,7 @@ The training corpus consists of 70 billion tokens of Catalan- and Spanish-centri
233
 
234
  This highly multilingual corpus is predominantly composed of data sourced from OPUS, with additional data taken from the NTEU project and Project Aina’s existing corpora. Where little parallel Catalan <-> data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using Project Aina’s es-> ca model. (link and correct name). The final distribution of languages was as below:
235
 
236
- And they you add them as a link without any text. For example: ![](./images/treemap.png)
237
 
238
  Click the expand button below to see the full list of corpora included in the training data.
239
 
@@ -269,28 +269,6 @@ Click the expand button below to see the full list of corpora included in the tr
269
 
270
 
271
 
272
-
273
-
274
-
275
-
276
-
277
-
278
-
279
-
280
-
281
-
282
-
283
-
284
-
285
-
286
- <details>
287
- <summary>References</summary>
288
-
289
- - to be added
290
-
291
-
292
- </details>
293
-
294
  </details>
295
 
296
 
@@ -303,7 +281,7 @@ We provide an extense Datasheet section following the best practices defined by
303
 
304
  **For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.**
305
 
306
- The purpose of creating this dataset is to pre-train multilingual models on parallel data in a large number of European languages, with Spanish and Catalan as the pivot languages. We have found that there is a lack of high quality parallel data in the scale necessary for training models, particularly between mid to low resource languages, and so in this dataset we have attempted to compile all publicly available resources for the included smaller languages.
307
 
308
  **Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?**
309
 
@@ -333,7 +311,10 @@ The dataset contains a diverse range of sentence pairs across multiple languages
333
 
334
  **Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).**
335
 
336
- The dataset is a sample from various sources. Language pairs which had fewer than 100 million parallel sentence pairs after filtering and cleaning were taken in their entirety. A sample of 100 million sentence pairs was taken from language pairs which had more data than this after preprocessing. All sampling was random. Where very little data existed between Catalan and the target language, synthetic Catalan data was created in order to increase the sample size. This was done using Projecte Aina’s xxx model.
 
 
 
337
 
338
  **What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.**
339
 
@@ -498,6 +479,14 @@ The dataset does not allow for external contributions.
498
 
499
  </details>
500
 
 
 
 
 
 
 
 
 
501
 
502
 
503
  ## Evaluation
 
233
 
234
  This highly multilingual corpus is predominantly composed of data sourced from OPUS, with additional data taken from the NTEU project and Project Aina’s existing corpora. Where little parallel Catalan <-> data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using Project Aina’s es-> ca model. (link and correct name). The final distribution of languages was as below:
235
 
236
+ ![](./images/treemap.png)
237
 
238
  Click the expand button below to see the full list of corpora included in the training data.
239
 
 
269
 
270
 
271
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
272
  </details>
273
 
274
 
 
281
 
282
  **For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.**
283
 
284
+ The purpose of creating this dataset is to pre-train multilingual models on parallel data in a large number of European languages, with Spanish and Catalan as the pivot languages. We have found that there is a lack of high quality parallel data in the scale necessary for training models, particularly between mid to low resource languages, and so in this dataset we have attempted to compile all publicly available resources for the included smaller languages, in addition to creating additional resources for Catalan as the pivot language.
285
 
286
  **Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?**
287
 
 
311
 
312
  **Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).**
313
 
314
+ The dataset is a sample from various sources. Language pairs which had fewer than 100 million parallel sentence pairs after filtering and cleaning were taken
315
+ in their entirety. A sample of 100 million sentence pairs was taken from language pairs which had more data than this after preprocessing. All sampling was random.
316
+ Where very little data existed between Catalan and the target language, synthetic Catalan data was created in order to increase the sample size.
317
+ This was done using [Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca).
318
 
319
  **What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.**
320
 
 
479
 
480
  </details>
481
 
482
+ <details>
483
+ <summary>References</summary>
484
+
485
+ - to be added
486
+
487
+
488
+ </details>
489
+
490
 
491
 
492
  ## Evaluation