DmitryPogrebnoy commited on
Commit
dbc7808
·
1 Parent(s): 76cb456

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +59 -1
README.md CHANGED
@@ -1,3 +1,61 @@
1
  ---
2
- license: bsd-3-clause
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - ru
4
+ license: apache-2.0
5
  ---
6
+
7
+ # Model DmitryPogrebnoy/distilbert-base-russian-cased
8
+
9
+ # Model Description
10
+
11
+ This model is russian version of [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased).
12
+ The code for the transforming process can be found [here](https://github.com/DmitryPogrebnoy/MedSpellChecker/blob/main/spellchecker/ml_ranging/models/distilbert_base_russian_cased/distilbert_from_multilang_to_ru.ipynb).
13
+
14
+ This model give exactly the same representations produced by the original model which preserves the original accuracy.
15
+ There is a similar model of [Geotrend/distilbert-base-ru-cased](https://huggingface.co/Geotrend/distilbert-base-ru-cased).
16
+ However, our model is derived from a slightly different approach.
17
+ Instead of using wikipedia's Russian dataset to pick the necessary tokens,
18
+ we used regular expressions in this model to select only Russian tokens, punctuation marks, numbers and other service tokens.
19
+ Thus, our model contains several hundred tokens, which have been filtered out in [Geotrend/distilbert-base-ru-cased](https://huggingface.co/Geotrend/distilbert-base-ru-cased).
20
+
21
+ This model was created as part of a master's project to develop a method for correcting typos
22
+ in medical histories using BERT models as a ranking of candidates.
23
+ The project is open source and can be found [here](https://github.com/DmitryPogrebnoy/MedSpellChecker).
24
+
25
+ # How to Get Started With the Model
26
+
27
+ You can use the model directly with a pipeline for masked language modeling:
28
+
29
+ ```python
30
+ >>> from transformers import pipeline
31
+ >>> pipeline = pipeline('fill-mask', model='DmitryPogrebnoy/distilbert-base-russian-cased')
32
+ >>> pipeline("Я [MASK] на заводе.")
33
+ [{'score': 0.11498937010765076,
34
+ 'token': 1709,
35
+ 'token_str': 'работал',
36
+ 'sequence': 'Я работал на заводе.'},
37
+ {'score': 0.07212855666875839,
38
+ 'token': 12375,
39
+ 'token_str': '##росла',
40
+ 'sequence': 'Яросла на заводе.'},
41
+ {'score': 0.03575785085558891,
42
+ 'token': 4059,
43
+ 'token_str': 'находился',
44
+ 'sequence': 'Я находился на заводе.'},
45
+ {'score': 0.02496381290256977,
46
+ 'token': 5075,
47
+ 'token_str': 'работает',
48
+ 'sequence': 'Я работает на заводе.'},
49
+ {'score': 0.020675526931881905,
50
+ 'token': 5774,
51
+ 'token_str': '##дро',
52
+ 'sequence': 'Ядро на заводе.'}]
53
+ ```
54
+
55
+ Or you can load the model and tokenizer and do what you need to do:
56
+
57
+ ```python
58
+ >>> from transformers import AutoTokenizer, AutoModelForMaskedLM
59
+ >>> tokenizer = AutoTokenizer.from_pretrained("DmitryPogrebnoy/distilbert-base-russian-cased")
60
+ >>> model = AutoModelForMaskedLM.from_pretrained("DmitryPogrebnoy/distilbert-base-russian-cased")
61
+ ```