rmihaylov commited on
Commit
ad0bdad
·
1 Parent(s): 64032ac

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +52 -7
README.md CHANGED
@@ -7,31 +7,62 @@ datasets:
7
  - oscar
8
  - chitanka
9
  - wikipedia
 
 
10
  ---
11
 
12
- This is Bulgarian GPT2 (SMALL) compressed via [progressive module replacing](https://arxiv.org/abs/2002.02925).
 
 
 
 
 
 
 
 
13
 
14
  The compression was executed on Bulgarian text from [OSCAR](https://oscar-corpus.com/post/oscar-2019/), [Chitanka](https://chitanka.info/) and [Wikipedia](https://bg.wikipedia.org/).
15
 
 
 
 
 
 
 
 
 
 
 
 
16
  Here is how to use this model in PyTorch:
17
 
18
  ```python
19
  >>> from transformers import AutoModel, AutoTokenizer
20
- >>> tokenizer = AutoTokenizer.from_pretrained("rmihaylov/gpt2-small-theseus-bg")
21
- >>> model = AutoModel.from_pretrained("rmihaylov/gpt2-small-theseus-bg", trust_remote_code=True)
 
 
 
 
 
 
 
22
 
23
- >>> input_ids = tokenizer.encode("Здравей,", add_special_tokens=False, return_tensors='pt')
24
  >>> output_ids = model.generate(
25
  >>> input_ids,
26
  >>> do_sample=True,
27
  >>> max_length=50,
28
  >>> top_p=0.92,
29
  >>> pad_token_id=2,
30
- >>> top_k=0
31
- >>> )
32
 
33
  >>> output = tokenizer.decode(output_ids[0])
34
- >>> output = output.replace('<|endoftext|>', '\n\n\n').replace('<|unknown|>', '').replace('▁', ' ').strip().replace('<|n|>', '\n')
 
 
 
 
 
35
  >>> print(output)
36
 
37
  Здравей, извинявай, но не мога да заспя.
@@ -39,3 +70,17 @@ Here is how to use this model in PyTorch:
39
  — Почакай, Джини. Не мога да повярвам, че е възможно! Толкова искам да те видя.
40
  — Обеща
41
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  - oscar
8
  - chitanka
9
  - wikipedia
10
+ tags:
11
+ - torch
12
  ---
13
 
14
+ # GPT-2
15
+
16
+ Pretrained model on Bulgarian language using a causal language modeling (CLM) objective. It was introduced in
17
+ [this paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
18
+ and first released at [this page](https://openai.com/blog/better-language-models/).
19
+
20
+ ## Model description
21
+
22
+ This is the **SMALL** version compressed via [progressive module replacing](https://arxiv.org/abs/2002.02925).
23
 
24
  The compression was executed on Bulgarian text from [OSCAR](https://oscar-corpus.com/post/oscar-2019/), [Chitanka](https://chitanka.info/) and [Wikipedia](https://bg.wikipedia.org/).
25
 
26
+ ## Intended uses & limitations
27
+
28
+ You can use the raw model for:
29
+ - text generation
30
+ - auto-complete
31
+ - spelling correction
32
+
33
+ Or fine-tune it to a downstream task.
34
+
35
+ ### How to use
36
+
37
  Here is how to use this model in PyTorch:
38
 
39
  ```python
40
  >>> from transformers import AutoModel, AutoTokenizer
41
+ >>>
42
+ >>> model_id = "rmihaylov/gpt2-small-theseus-bg"
43
+ >>> tokenizer = AutoTokenizer.from_pretrained(model_id)
44
+ >>> model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
45
+
46
+ >>> input_ids = tokenizer.encode(
47
+ >>> "Здравей,",
48
+ >>> add_special_tokens=False,
49
+ >>> return_tensors='pt')
50
 
 
51
  >>> output_ids = model.generate(
52
  >>> input_ids,
53
  >>> do_sample=True,
54
  >>> max_length=50,
55
  >>> top_p=0.92,
56
  >>> pad_token_id=2,
57
+ >>> top_k=0)
 
58
 
59
  >>> output = tokenizer.decode(output_ids[0])
60
+ >>>
61
+ >>> output = output.replace('<|endoftext|>', '\n\n\n')
62
+ >>> output = output.replace('<|unknown|>', '')
63
+ >>> output = output.replace('▁', ' ')
64
+ >>> output = output.replace('<|n|>', '\n')
65
+ >>>
66
  >>> print(output)
67
 
68
  Здравей, извинявай, но не мога да заспя.
 
70
  — Почакай, Джини. Не мога да повярвам, че е възможно! Толкова искам да те видя.
71
  — Обеща
72
  ```
73
+
74
+ ### Limitations and bias
75
+
76
+ As the openAI team themselves point out in their
77
+ [model card](https://github.com/openai/gpt-2/blob/master/model_card.md#out-of-scope-use-cases):
78
+
79
+ > Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases
80
+ > that require the generated text to be true.
81
+ >
82
+ > Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do
83
+ > not recommend that they be deployed into systems that interact with humans > unless the deployers first carry out a
84
+ > study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race,
85
+ > and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar
86
+ > levels of caution around use cases that are sensitive to biases around human attributes.