File size: 2,043 Bytes
dfe781a
f46d219
 
 
dfe781a
f46d219
 
 
 
 
 
 
a02aee8
f46d219
e1e5a9b
f46d219
 
 
 
 
 
 
dfe781a
e1e5a9b
53f2c82
f46d219
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6d39c5b
f46d219
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a02aee8
f46d219
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
---
language:
- cy
- en
license: apache-2.0
pipeline_tag: translation
tags:
- translation
- marian
metrics:
  - bleu
widget:
 - text: Mae gan Lywodraeth Cymru targed i gyrraedd miliwn o siariadwyr Cymraeg erbyn y flwyddyn 2020."
model-index:
- name: mt-general-cy-en
  results:
  - task:
      name: Translation
      type: translation
    metrics:
      - type: bleu
        value: 54
---
# mt-general-cy-en
A general language translation model for translating between Welsh and English.

This model was trained using custom DVC pipeline employing [Marian NMT](https://marian-nmt.github.io/), 
the datasets prepared were generated from the following sources:
 - [UK Government Legislation data](https://www.legislation.gov.uk)
 - [OPUS-cy-en](https://opus.nlpl.eu/)
 - [Cofnod Y Cynulliad](https://record.assembly.wales/)
 - [Cofion Techiaith Cymru](https://cofion.techiaith.cymru)

The data was split into train, validation and test sets; the test comprising of a random slice of 20% of the total dataset. Segments were selected randomly form 
of text and TMX from the datasets described above.
The datasets were cleaned, without any pre-tokenisation, utilising a SentencePiece vocabulary model, and then fed into a 10 separate Marian NMT training processes, the data having been split into
split into 10 training and validation sets.

## Evaluation

The BLEU evaluation score was produced using the python library [SacreBLEU](https://github.com/mjpost/sacrebleu).
## Usage

Ensure you have the prerequisite python libraries installed:

```bsdh
pip install transformers sentencepiece
```

```python
import trnasformers
model_id = "mgrbyte/mt-general-cy-en"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_id)
translate = transformers.pipeline("translation", model=model, tokenizer=tokenizer)
translated = translate(
   "Mae gan Lywodraeth Cymru targed i gyrraedd miliwn o siariadwyr Cymraeg erbyn y flwyddyn 2020."
)
print(translated["translation_text"])
```