File size: 2,043 Bytes
dfe781a f46d219 dfe781a f46d219 a02aee8 f46d219 e1e5a9b f46d219 dfe781a e1e5a9b 53f2c82 f46d219 6d39c5b f46d219 a02aee8 f46d219 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
---
language:
- cy
- en
license: apache-2.0
pipeline_tag: translation
tags:
- translation
- marian
metrics:
- bleu
widget:
- text: Mae gan Lywodraeth Cymru targed i gyrraedd miliwn o siariadwyr Cymraeg erbyn y flwyddyn 2020."
model-index:
- name: mt-general-cy-en
results:
- task:
name: Translation
type: translation
metrics:
- type: bleu
value: 54
---
# mt-general-cy-en
A general language translation model for translating between Welsh and English.
This model was trained using custom DVC pipeline employing [Marian NMT](https://marian-nmt.github.io/),
the datasets prepared were generated from the following sources:
- [UK Government Legislation data](https://www.legislation.gov.uk)
- [OPUS-cy-en](https://opus.nlpl.eu/)
- [Cofnod Y Cynulliad](https://record.assembly.wales/)
- [Cofion Techiaith Cymru](https://cofion.techiaith.cymru)
The data was split into train, validation and test sets; the test comprising of a random slice of 20% of the total dataset. Segments were selected randomly form
of text and TMX from the datasets described above.
The datasets were cleaned, without any pre-tokenisation, utilising a SentencePiece vocabulary model, and then fed into a 10 separate Marian NMT training processes, the data having been split into
split into 10 training and validation sets.
## Evaluation
The BLEU evaluation score was produced using the python library [SacreBLEU](https://github.com/mjpost/sacrebleu).
## Usage
Ensure you have the prerequisite python libraries installed:
```bsdh
pip install transformers sentencepiece
```
```python
import trnasformers
model_id = "mgrbyte/mt-general-cy-en"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_id)
translate = transformers.pipeline("translation", model=model, tokenizer=tokenizer)
translated = translate(
"Mae gan Lywodraeth Cymru targed i gyrraedd miliwn o siariadwyr Cymraeg erbyn y flwyddyn 2020."
)
print(translated["translation_text"])
```
|