peterhung commited on
Commit
fdebe5b
·
verified ·
1 Parent(s): cfe60a2

Update README.md

Browse files

Add more instructions

Files changed (1) hide show
  1. README.md +105 -9
README.md CHANGED
@@ -1,31 +1,127 @@
1
  ---
2
- license: afl-3.0
3
  language:
4
  - vi
5
  pipeline_tag: token-classification
6
  tags:
7
  - vietnamese
8
  - accents inserter
 
 
9
  ---
10
 
11
  # A Transformer model for inserting Vietnamese accent marks
12
 
13
  This model is finetuned from the XLM-Roberta Large.
14
 
15
- Example input: Toi di hoc.
16
- Target output: Tôi đi học.
17
 
18
  ## Model training
19
  This problem was modelled as a token classification problem. For each input token, the goal is to asssign a "tag" that will transform it
20
  to the accented token.
21
- For more details on the training process, please refer to this [blog post](https://peterhung.org/tech/insert-vietnamese-accent-transformer-model/).
 
 
 
22
 
23
  ## How to use this model
24
- There are 2 main steps:
25
- - Load the model as a token classification model (*AutoModelForTokenClassification*).
26
- - Run the input through the model to obtain the tag index for each input token.
27
- - Use the tags' index to retreive the actual tags in the file *selected_tags_names.txt*.
28
- - Apply the transformation to each token to obtain accented tokens.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
 
 
 
 
 
 
 
 
 
 
30
 
31
 
 
 
 
1
  ---
2
+ license: mit
3
  language:
4
  - vi
5
  pipeline_tag: token-classification
6
  tags:
7
  - vietnamese
8
  - accents inserter
9
+ metrics:
10
+ - accuracy
11
  ---
12
 
13
  # A Transformer model for inserting Vietnamese accent marks
14
 
15
  This model is finetuned from the XLM-Roberta Large.
16
 
17
+ Example input: Nhin nhung mua thu di
18
+ Target output: Nhìn những mùa thu đi
19
 
20
  ## Model training
21
  This problem was modelled as a token classification problem. For each input token, the goal is to asssign a "tag" that will transform it
22
  to the accented token.
23
+
24
+ For more details on the training process, please refer to this
25
+ <a href="https://peterhung.org/tech/insert-vietnamese-accent-transformer-model/" target="_blank">blog post</a>.
26
+
27
 
28
  ## How to use this model
29
+ There are just a few steps:
30
+ - Step 1: Load the model as a token classification model (*AutoModelForTokenClassification*).
31
+ - Step 2: Run the input through the model to obtain the tag index for each input token.
32
+ - Step 3: Use the tags' index to retreive the actual tags in the file *selected_tags_names.txt*. Then,
33
+ apply the conversion indicated by the tag to each token to obtain accented tokens.
34
+
35
+ ### Step 1: Load model
36
+ Note: Install *transformers*, *torch*, *numpy* packages first.
37
+
38
+ ```python
39
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
40
+ import torch
41
+ import numpy as np
42
+
43
+ def load_trained_transformer_model():
44
+ model_path = "peterhung/transformer-vnaccent-marker"
45
+ tokenizer = AutoTokenizer.from_pretrained(model_path, add_prefix_space=True)
46
+ model = AutoModelForTokenClassification.from_pretrained(model_path)
47
+ return model, tokenizer
48
+
49
+ model, tokenizer = load_trained_transformer_model()
50
+ ```
51
+
52
+ ### Step 2: Run input text through the model
53
+
54
+ ```python
55
+ # only needed if it's run on GPU
56
+ device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
57
+ model.to(device)
58
+
59
+ # set to eval mode
60
+ model.eval()
61
+
62
+ def insert_accents(text, model, tokenizer):
63
+ our_tokens = text.strip().split()
64
+
65
+ # the tokenizer may further split our tokens
66
+ inputs = tokenizer(our_tokens,
67
+ is_split_into_words=True,
68
+ truncation=True,
69
+ padding=True,
70
+ return_tensors="pt"
71
+ )
72
+ input_ids = inputs['input_ids']
73
+ tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
74
+ tokens = tokens[1:-1]
75
+
76
+ with torch.no_grad():
77
+ inputs.to(device)
78
+ outputs = model(**inputs)
79
+
80
+ predictions = outputs["logits"].cpu().numpy()
81
+ predictions = np.argmax(predictions, axis=2)
82
+
83
+ # exclude output at index 0 and the last index, which correspond to '<s>' and '</s>'
84
+ predictions = predictions[0][1:-1]
85
+
86
+ assert len(tokens) == len(predictions)
87
+
88
+ return tokens, predictions
89
+
90
+
91
+ text = "Nhin nhung mua thu di, em nghe sau len trong nang."
92
+
93
+ tokens, predictions = insert_accents(text, model, tokenizer)
94
+ ```
95
+
96
+ ### Step3: Obtain the accented words
97
+
98
+ 3.1 Download the tags set file from this repo. Then load it
99
+ ```python
100
+ def _load_tags_set(fpath):
101
+ labels = []
102
+ with open(fpath, 'r') as f:
103
+ for line in f:
104
+ line = line.strip()
105
+ if line:
106
+ labels.append(line)
107
+
108
+ return labels
109
+
110
+ label_list = _load_tags_set("/content/training_data/vnaccent/corpus-title.train.selected_tags_names.txt")
111
+ assert len(label_list) == 528, f"Expect {len(label_list)} tags"
112
+ ```
113
 
114
+ 3.2 Print out `tokens` and `predictions` obtained above to see what we're having here
115
+ ```python
116
+ print(tokens)
117
+ print(list(f"{pred} ({label_list[pred]})" for pred in predictions))
118
+ ```
119
+ Obtained
120
+ ```python
121
+ ['▁Nhi', 'n', '▁nhu', 'ng', '▁mua', '▁thu', '▁di', ',', '▁em', '▁nghe', '▁sau', '▁len', '▁trong', '▁nang', '.']
122
+ ['217 (i-ì)', '217 (i-ì)', '388 (u-ữ)', '388 (u-ữ)', '407 (ua-ùa)', '378 (u-u)', '120 (di-đi)', '0 (-)', '185 (e-e)', '185 (e-e)', '41 (au-âu)', '188 (e-ê)', '302 (o-o)', '14 (a-ắ)', '0 (-)']
123
+ ```
124
 
125
 
126
+ ## Limitations
127
+ - This model will accept a maximum of 512 tokens, which is a limitation inherited from the base pretrained XLM-Roberta model.