mgrbyte commited on
Commit
6049163
·
1 Parent(s): 3b6c80d

Initial version of model card

Browse files
Files changed (1) hide show
  1. README.md +57 -0
README.md CHANGED
@@ -1,3 +1,60 @@
1
  ---
 
 
 
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - cy
4
+ - en
5
  license: apache-2.0
6
+ pipeline_tag: translation
7
+ tags:
8
+ - translation
9
+ - marian
10
+ metrics:
11
+ - bleu
12
+ widget:
13
+ - text: Mae gan Lywodraeth Cymru targed i gyrraedd miliwn o siaradwr Cymraeg erbyn y flwyddyn 2020."
14
+ model-index:
15
+ - name: mt-dspec-legislation-en-cy
16
+ results:
17
+ - task:
18
+ name: Translation
19
+ type: translation
20
+ metrics:
21
+ - type: bleu
22
+ value: 54
23
  ---
24
+ # mt-dspec-legislation-en-cy
25
+ A language translation model for translating between English and Welsh, specialised to the specific domain of Legislation.
26
+
27
+ This model was trained using custom DVC pipeline employing [Marian NMT](https://marian-nmt.github.io/),
28
+ the datasets prepared were generated from the following sources:
29
+ - [UK Government Legislation data](https://www.legislation.gov.uk)
30
+ - [OPUS-cy-en](https://opus.nlpl.eu/)
31
+ - [Cofnod Y Cynulliad](https://record.assembly.wales/)
32
+ - [Cofion Techiaith Cymru](https://cofion.techiaith.cymru)
33
+
34
+ The data was split into train, validation and test sets; the test comprising of a random slice of 20% of the total dataset. Segments were selected randomly form
35
+ of text and TMX from the datasets described above.
36
+ The datasets were cleaned, without any pre-tokenisation, utilising a SentencePiece vocabulary model, and then fed into a 10 separate Marian NMT training processes, the data having been split into
37
+ split into 10 training and validation sets.
38
+
39
+ ## Evaluation
40
+
41
+ The BLEU evaluation score was produced using the python libraries [SacreBLEU](https://github.com/mjpost/sacrebleu).
42
+ ## Usage
43
+
44
+ Ensure you have the prerequisite python libraries installed:
45
+
46
+ ```bsdh
47
+ pip install transformers sentencepiece
48
+ ```
49
+
50
+ ```python
51
+ import trnasformers
52
+ model_id = "mgrbyte/mt-general-cy-en"
53
+ tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
54
+ model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_id)
55
+ translate = transformers.pipeline("translation", model=model, tokenizer=tokenizer)
56
+ translated = translate(
57
+ "Mae gan Lywodraeth Cymru targed i gyrraedd miliwn o siaradwr Cymraeg erbyn y flwyddyn 2020."
58
+ )
59
+ print(translated["translation_text"])
60
+ ```