ai-forever commited on
Commit
dee8b98
·
verified ·
1 Parent(s): 1a9131c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +203 -0
README.md CHANGED
@@ -1,3 +1,206 @@
1
  ---
 
 
 
 
 
 
 
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - ru
4
+ tags:
5
+ - spellchecking
6
+ - M2M100
7
+ - pytorch
8
+ - natural language generation
9
  license: mit
10
+ datasets:
11
+ - ai-forever/spellcheck_benchmark
12
+ metrics:
13
+ - precision
14
+ - recall
15
+ - f1
16
+ library_name: transformers
17
+ model-index:
18
+ - name: sage-mt5-large
19
+ results:
20
+ - task:
21
+ type: text-generation
22
+ dataset:
23
+ type: spellcheck_benchmark
24
+ name: RUSpellRU
25
+ metrics:
26
+ - name: Precision
27
+ type: precision
28
+ value: 88.8
29
+ verified: false
30
+ - name: Recall
31
+ type: recall
32
+ value: 71.5
33
+ verified: false
34
+ - name: F1
35
+ type: f1
36
+ value: 79.2
37
+ verified: false
38
+ - task:
39
+ type: text-generation
40
+ dataset:
41
+ type: spellcheck_benchmark
42
+ name: MultidomainGold
43
+ metrics:
44
+ - name: Precision
45
+ type: precision
46
+ value: 63.8
47
+ verified: false
48
+ - name: Recall
49
+ type: recall
50
+ value: 61.1
51
+ verified: false
52
+ - name: F1
53
+ type: f1
54
+ value: 62.4
55
+ verified: false
56
+ - task:
57
+ type: text-generation
58
+ dataset:
59
+ type: spellcheck_benchmark
60
+ name: MedSpellchecker
61
+ metrics:
62
+ - name: Precision
63
+ type: precision
64
+ value: 78.8
65
+ verified: false
66
+ - name: Recall
67
+ type: recall
68
+ value: 71.4
69
+ verified: false
70
+ - name: F1
71
+ type: f1
72
+ value: 74.9
73
+ verified: false
74
+ - task:
75
+ type: text-generation
76
+ dataset:
77
+ type: spellcheck_benchmark
78
+ name: GitHubTypoCorpusRu
79
+ metrics:
80
+ - name: Precision
81
+ type: precision
82
+ value: 47.1
83
+ verified: false
84
+ - name: Recall
85
+ type: recall
86
+ value: 42.9
87
+ verified: false
88
+ - name: F1
89
+ type: f1
90
+ value: 44.9
91
+ verified: false
92
  ---
93
+ # sage-m2m100-1.2B model
94
+
95
+ ## Summary
96
+
97
+ The model corrects spelling errors and typos by bringing all the words in the text to the norm of the Russian language.
98
+ Corrector was trained based on the model [M2M100-1.2B](https://huggingface.co/facebook/m2m100_1.2B).
99
+ An extensive dataset with “artificial” errors was taken as a training corpus: the corpus was assembled on the basis of the Russian-language Wikipedia and transcripts of Russian-language videos, then typos and spelling errors were automatically introduced into it using the library [SAGE](https://github.com/ai-forever/sage).
100
+ The model is the fine-tuned version of the [pre-train](https://huggingface.co/ai-forever/RuM2M100-1.2B).
101
+
102
+ ## Public references
103
+ - [SAGE library announcement](https://youtu.be/yFfkV0Qjuu0), DataFest 2023
104
+ - [Paper about synthetic error generation methods](https://www.dialog-21.ru/media/5914/martynovnplusetal056.pdf), Dialogue 2023
105
+ - [SAGE EACL 2024 paper](https://aclanthology.org/2024.findings-eacl.10/)
106
+
107
+
108
+ ## Examples
109
+ | Input | Output |
110
+ | --- | --- |
111
+ | Думю ешцъа лет череа 10 ретроспективно просматривотьэ то будкетцц мне невероя тна ин те р но | Думаю что лет через 10 ретроспективно просматривать это будет мне невероятно интересно |
112
+ | Основая цель мероприятия - практическая отработка навыков по оказанию помощи гражданам, попавшим в ДТП, а также повышение и совершенствование уровня профессиональной подготовки сотрудников МЧС при проведении аварийно-спасательных работ по ликвидации последствий дорожно-транспортных проишествий, сокращение временных показателей реагирования. | Основная цель мероприятия - практическая отработка навыков по оказанию помощи гражданам, попавшим в ДТП, а также повышение и совершенствование уровня профессиональной подготовки сотрудников МЧС при проведении аварийно-спасательных работ по ликвидации последствий дорожно-транспортных происшествий, сокращение временных показателей реагирования. |
113
+ | прийдя в МГТУ я был удивлен никого необноружив там… | придя в МГТУ я был удивлен никого не обнаружив там |
114
+ | | |
115
+
116
+ ## Metrics
117
+ ### Quality
118
+ Below are automatic metrics for determining the correctness of the spell checkers.
119
+ We compare our solution with both open automatic spell checkers and the ChatGPT family of models on all four available datasets:
120
+ - **RUSpellRU**: texts collected from ([LiveJournal](https://www.livejournal.com/media)), with manually corrected typos and errors;
121
+ - **MultidomainGold**: examples from 7 text sources, including the open web, news, social media, reviews, subtitles, policy documents and literary works;
122
+ - **MedSpellChecker**: texts with errors from medical anamnesis;
123
+ - **GitHubTypoCorpusRu**: spelling errors and typos in commits from [GitHub](https://github.com);
124
+
125
+ **RUSpellRU**
126
+ | Model | Precision | Recall | F1 |
127
+ | --- | --- | --- | --- |
128
+ | sage-m2m100-1.2B | 88.8 | 71.5 | 79.2 |
129
+ | sage-ai-service | 93.5 | 82.4 | 87.6 |
130
+ | gpt-3.5-turbo | 39.6 | 62.3 | 48.5 |
131
+ | gpt-4 | 69.5 | 81.0 | 74.8 |
132
+ | Yandex.Speller | 83.0 | 59.8 | 69.5 |
133
+ | JamSpell | 42.1 | 32.8 | 36.9 |
134
+ | HunSpell | 31.3 | 34.9 | 33.0 |
135
+
136
+ **MultidomainGold**
137
+ | Model | Precision | Recall | F1 |
138
+ | --- | --- | --- | --- |
139
+ | sage-m2m100-1.2B | 63.8 | 61.1 | 62.4 |
140
+ | sage-ai-service | 70.9 | 68.8 | 69.9 |
141
+ | gpt-3.5-turbo | 17.8 | 56.1 | 27.0 |
142
+ | gpt-4 | 31.1 | 78.1 | 44.5 |
143
+ | Yandex.Speller | 52.9 | 51.4 | 52.2 |
144
+ | JamSpell | 25.7 | 30.6 | 28.0 |
145
+ | HunSpell | 16.2 | 40.1 | 23.0 |
146
+
147
+ **MedSpellChecker**
148
+ | Model | Precision | Recall | F1 |
149
+ | --- | --- | --- | --- |
150
+ | sage-m2m100-1.2B | 78.8 | 71.4 | 74.9 |
151
+ | sage-ai-service | 73.4 | 76.2 | 74.9 |
152
+ | gpt-3.5-turbo | 15.1 | 53.6 | 23.5 |
153
+ | gpt-4 | 48.9 | 88.7 | 63.1 |
154
+ | Yandex.Speller | 80.6 | 47.8 | 60.0 |
155
+ | JamSpell | 24.6 | 29.7 | 26.9 |
156
+ | HunSpell | 10.3 | 40.2 | 16.4 |
157
+
158
+ **GitHubTypoCorpusRu**
159
+ | Model | Precision | Recall | F1 |
160
+ | --- | --- | --- | --- |
161
+ | sage-m2m100-1.2B | 47.1 | 42.9 | 44.9 |
162
+ | sage-ai-service | 76.1 | 51.2 | 61.2 |
163
+ | gpt-3.5-turbo | 23.7 | 43.9 | 30.8 |
164
+ | gpt-4 | 34.7 | 60.5 | 44.1|
165
+ | Yandex.Speller | 67.7 | 37.5 | 48.3 |
166
+ | JamSpell | 49.5 | 29.9 | 37.3 |
167
+ | HunSpell | 28.5 | 30.7 | 29.6 |
168
+
169
+ ## How to use
170
+ ```python
171
+ from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
172
+
173
+ path_to_model = "ai-forever/sage-m2m100-1.2B"
174
+ model = M2M100ForConditionalGeneration.from_pretrained(path_to_model)
175
+ tokenizer = M2M100Tokenizer.from_pretrained(path_to_model, src_lang="ru", tgt_lang="ru")
176
+
177
+ sentence = "прийдя в МГТУ я был удивлен никого необноружив там…"
178
+ encodings = tokenizer(sentence, return_tensors="pt")
179
+ generated_tokens = model.generate(
180
+ **encodings, forced_bos_token_id=tokenizer.get_lang_id("ru"))
181
+ answer = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
182
+
183
+ print(answer)
184
+ #["прийдя в МГТУ я был удивлен никого не обнаружив там..."]
185
+ ```
186
+
187
+ ## Resources
188
+ - [SAGE library](https://github.com/ai-forever/sage), GitHub
189
+ - [sage-fredt5-large](https://huggingface.co/ai-forever/sage-fredt5-large), HuggingFace
190
+ - [sage-fredt5-distilled-95m](https://huggingface.co/ai-forever/sage-fredt5-distilled-95m), HuggingFace
191
+ - [sage-m2m100-1.2B](https://huggingface.co/ai-forever/sage-m2m100-1.2B), HuggingFace
192
+ - [sage-mt5-large](https://huggingface.co/ai-forever/sage-mt5-large), HuggingFace
193
+
194
+ ## License
195
+ Model [M2M100-1.2B](https://huggingface.co/facebook/m2m100_1.2B), on the basis of which our solution is made, and its source code are supplied under the MIT open license.
196
+ Our solution also comes with MIT license.
197
+
198
+ ## Specifications
199
+ - File size: 5 Gb;
200
+ - Framework: pytorch
201
+ - Format: AI Service
202
+ - Version: v2.0
203
+ - Developer: SberDevices, AGI NLP
204
+
205
+ ## Contacts
206