Update README.md
Browse files
README.md
CHANGED
@@ -233,7 +233,7 @@ The training corpus consists of 70 billion tokens of Catalan- and Spanish-centri
|
|
233 |
|
234 |
This highly multilingual corpus is predominantly composed of data sourced from OPUS, with additional data taken from the NTEU project and Project Aina’s existing corpora. Where little parallel Catalan <-> data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using Project Aina’s es-> ca model. (link and correct name). The final distribution of languages was as below:
|
235 |
|
236 |
-
|
237 |
|
238 |
Click the expand button below to see the full list of corpora included in the training data.
|
239 |
|
@@ -269,28 +269,6 @@ Click the expand button below to see the full list of corpora included in the tr
|
|
269 |
|
270 |
|
271 |
|
272 |
-
|
273 |
-
|
274 |
-
|
275 |
-
|
276 |
-
|
277 |
-
|
278 |
-
|
279 |
-
|
280 |
-
|
281 |
-
|
282 |
-
|
283 |
-
|
284 |
-
|
285 |
-
|
286 |
-
<details>
|
287 |
-
<summary>References</summary>
|
288 |
-
|
289 |
-
- to be added
|
290 |
-
|
291 |
-
|
292 |
-
</details>
|
293 |
-
|
294 |
</details>
|
295 |
|
296 |
|
@@ -303,7 +281,7 @@ We provide an extense Datasheet section following the best practices defined by
|
|
303 |
|
304 |
**For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.**
|
305 |
|
306 |
-
The purpose of creating this dataset is to pre-train multilingual models on parallel data in a large number of European languages, with Spanish and Catalan as the pivot languages. We have found that there is a lack of high quality parallel data in the scale necessary for training models, particularly between mid to low resource languages, and so in this dataset we have attempted to compile all publicly available resources for the included smaller languages.
|
307 |
|
308 |
**Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?**
|
309 |
|
@@ -333,7 +311,10 @@ The dataset contains a diverse range of sentence pairs across multiple languages
|
|
333 |
|
334 |
**Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).**
|
335 |
|
336 |
-
The dataset is a sample from various sources. Language pairs which had fewer than 100 million parallel sentence pairs after filtering and cleaning were taken
|
|
|
|
|
|
|
337 |
|
338 |
**What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.**
|
339 |
|
@@ -498,6 +479,14 @@ The dataset does not allow for external contributions.
|
|
498 |
|
499 |
</details>
|
500 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
501 |
|
502 |
|
503 |
## Evaluation
|
|
|
233 |
|
234 |
This highly multilingual corpus is predominantly composed of data sourced from OPUS, with additional data taken from the NTEU project and Project Aina’s existing corpora. Where little parallel Catalan <-> data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using Project Aina’s es-> ca model. (link and correct name). The final distribution of languages was as below:
|
235 |
|
236 |
+
![](./images/treemap.png)
|
237 |
|
238 |
Click the expand button below to see the full list of corpora included in the training data.
|
239 |
|
|
|
269 |
|
270 |
|
271 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
272 |
</details>
|
273 |
|
274 |
|
|
|
281 |
|
282 |
**For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.**
|
283 |
|
284 |
+
The purpose of creating this dataset is to pre-train multilingual models on parallel data in a large number of European languages, with Spanish and Catalan as the pivot languages. We have found that there is a lack of high quality parallel data in the scale necessary for training models, particularly between mid to low resource languages, and so in this dataset we have attempted to compile all publicly available resources for the included smaller languages, in addition to creating additional resources for Catalan as the pivot language.
|
285 |
|
286 |
**Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?**
|
287 |
|
|
|
311 |
|
312 |
**Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).**
|
313 |
|
314 |
+
The dataset is a sample from various sources. Language pairs which had fewer than 100 million parallel sentence pairs after filtering and cleaning were taken
|
315 |
+
in their entirety. A sample of 100 million sentence pairs was taken from language pairs which had more data than this after preprocessing. All sampling was random.
|
316 |
+
Where very little data existed between Catalan and the target language, synthetic Catalan data was created in order to increase the sample size.
|
317 |
+
This was done using [Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca).
|
318 |
|
319 |
**What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.**
|
320 |
|
|
|
479 |
|
480 |
</details>
|
481 |
|
482 |
+
<details>
|
483 |
+
<summary>References</summary>
|
484 |
+
|
485 |
+
- to be added
|
486 |
+
|
487 |
+
|
488 |
+
</details>
|
489 |
+
|
490 |
|
491 |
|
492 |
## Evaluation
|