RichardErkhov commited on
Commit
04e95f1
·
verified ·
1 Parent(s): 6c94b84

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +190 -0
README.md ADDED
@@ -0,0 +1,190 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ japanese-gpt-neox-3.6b-instruction-sft - bnb 4bits
11
+ - Model creator: https://huggingface.co/rinna/
12
+ - Original model: https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft/
13
+
14
+
15
+
16
+
17
+ Original model description:
18
+ ---
19
+ language: ja
20
+ thumbnail: https://github.com/rinnakk/japanese-pretrained-models/blob/master/rinna.png
21
+ tags:
22
+ - gpt_neox
23
+ - text-generation
24
+ - lm
25
+ - nlp
26
+ license: mit
27
+ datasets:
28
+ - Anthropic/hh-rlhf
29
+ - stanfordnlp/SHP
30
+ inference: false
31
+ base_model: rinna/japanese-gpt-neox-3.6b
32
+ ---
33
+
34
+ # japanese-gpt-neox-3.6b-instruction-sft
35
+
36
+ ![rinna-icon](./rinna.png)
37
+
38
+ # Overview
39
+ This repository provides a Japanese GPT-NeoX model of 3.6 billion parameters. The model is based on [`rinna/japanese-gpt-neox-3.6b`](https://huggingface.co/rinna/japanese-gpt-neox-3.6b) and has been finetuned to serve as an instruction-following conversational agent.
40
+
41
+ * **Model architecture**
42
+
43
+ A 36-layer, 2816-hidden-size transformer-based language model.
44
+
45
+ * **Finetuning**
46
+
47
+ The finetuning data is the subset of the following datasets and has been translated into Japanese.
48
+ * [Anthropic HH RLHF data](https://huggingface.co/datasets/Anthropic/hh-rlhf)
49
+ * [FLAN Instruction Tuning data](https://github.com/google-research/FLAN)
50
+ * [Stanford Human Preferences Dataset](https://huggingface.co/datasets/stanfordnlp/SHP)
51
+
52
+ The data will **not** be released.
53
+
54
+ * **Model Series**
55
+
56
+ | Variant | Link |
57
+ | :-- | :--|
58
+ | 3.6B PPO | https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-ppo |
59
+ | 3.6B SFT-v2 | https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft-v2 |
60
+ | 3.6B SFT | https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft |
61
+ | 3.6B pretrained | https://huggingface.co/rinna/japanese-gpt-neox-3.6b |
62
+
63
+ * **Contributors**
64
+
65
+ [Tianyu Zhao](https://huggingface.co/tianyuz) and [Kei Sawada](https://huggingface.co/keisawada)
66
+
67
+ # I/O Format
68
+ A special format has been adopted to construct inputs.
69
+ * An input prompt is formatted as a conversation between `ユーザー` and `システム`.
70
+ * Each input utterance consists of (1) its speaker (`"ユーザー"` or `"システム"`), (2) a colon (`":"`), (3) a whitespace (`" "`), and (4) utterance text (e.g. `"世界で一番高い山は?"`).
71
+ * The input prompt should be ended with `"システム: "` to acknowledge the model to generate a response.
72
+ * Since the model's tokenizer does not recognize `"\n"`, a special newline symbol `"<NL>"` is used instead.
73
+ * All the newlines in input and output utterances should be replaced with `"<NL>"`.
74
+ * All the utterances in the input prompt should be separated by `"<NL>"`.
75
+
76
+ Following is an example to construct an input from a conversation.
77
+ ~~~python
78
+ prompt = [
79
+ {
80
+ "speaker": "ユーザー",
81
+ "text": "日本のおすすめの観光地を教えてください。"
82
+ },
83
+ {
84
+ "speaker": "システム",
85
+ "text": "どの地域の観光地が知りたいですか?"
86
+ },
87
+ {
88
+ "speaker": "ユーザー",
89
+ "text": "渋谷の観光地を教えてください。"
90
+ }
91
+ ]
92
+ prompt = [
93
+ f"{uttr['speaker']}: {uttr['text']}"
94
+ for uttr in prompt
95
+ ]
96
+ prompt = "<NL>".join(prompt)
97
+ prompt = (
98
+ prompt
99
+ + "<NL>"
100
+ + "システム: "
101
+ )
102
+ print(prompt)
103
+ # "ユーザー: 日本のおすすめの観光地を教えてください。<NL>システム: どの地域の観光地が知りたいですか?<NL>ユーザー: 渋谷の観光地を教えてください。<NL>システム: "
104
+ ~~~
105
+
106
+ # How to use the model
107
+
108
+ ~~~~python
109
+ import torch
110
+ from transformers import AutoTokenizer, AutoModelForCausalLM
111
+
112
+ tokenizer = AutoTokenizer.from_pretrained("rinna/japanese-gpt-neox-3.6b-instruction-sft", use_fast=False)
113
+ model = AutoModelForCausalLM.from_pretrained("rinna/japanese-gpt-neox-3.6b-instruction-sft")
114
+
115
+ if torch.cuda.is_available():
116
+ model = model.to("cuda")
117
+
118
+ token_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
119
+
120
+ with torch.no_grad():
121
+ output_ids = model.generate(
122
+ token_ids.to(model.device),
123
+ do_sample=True,
124
+ max_new_tokens=128,
125
+ temperature=0.7,
126
+ pad_token_id=tokenizer.pad_token_id,
127
+ bos_token_id=tokenizer.bos_token_id,
128
+ eos_token_id=tokenizer.eos_token_id
129
+ )
130
+
131
+ output = tokenizer.decode(output_ids.tolist()[0][token_ids.size(1):])
132
+ output = output.replace("<NL>", "\n")
133
+ print(output)
134
+ """分かりました。いくつかのおすすめを紹介します。
135
+ 1. ハチ公像です。ハチ公像は、日本の観光スポットの1つとして人気があります。
136
+ 2. スクランブル交差点です。多くの人々が行き交う大きな交差点で、観光客に人気のスポットです。
137
+ 3. 109です。109は、ショッピングやエンターテイメント施設です。
138
+ 4. 道玄坂です。道玄坂は、日本の商業地区である坂道です。</s>"""
139
+ ~~~~
140
+
141
+ # Tokenization
142
+ The model uses a [sentencepiece](https://github.com/google/sentencepiece)-based tokenizer.
143
+ * The tokenizer has a vocabulary size of 32,000.
144
+ * It uses sentencepiece's byte fallback feature to decompose unknown text pieces into UTF-8 byte pieces and to avoid producing `<UNK>` tokens.
145
+ * sentencepiece's `--add_dummy_prefix` option was turned off so that a leading whitespace will not be prepended automatically.
146
+ ~~~
147
+ print(tokenizer.tokenize("吾輩は猫である"))
148
+ # ['吾', '輩', 'は', '猫', 'である']
149
+ # instead of ['▁', '吾', '輩', 'は', '猫', 'である'] as in rinna/japanese-gpt-1b
150
+ ~~~
151
+ * sentencepiece's `--remove_extra_whitespaces` option was turned off so that leading, trailing, and duplicate whitespaces are reserved.
152
+ ~~~
153
+ print(tokenizer.tokenize(" 吾輩は 猫である "))
154
+ # ['▁', '▁', '吾', '輩', 'は', '▁', '▁', '猫', 'である', '▁', '▁', '▁']
155
+ # instead of ['▁', '吾', '輩', 'は', '▁猫', 'である'] as in rinna/japanese-gpt-1b
156
+ ~~~
157
+ * Don't forget to set `use_fast=False` to make the above features function correctly.
158
+ ~~~
159
+ good_tokenizer = AutoTokenizer.from_pretrained("rinna/japanese-gpt-neox-3.6b", use_fast=False)
160
+ bad_tokenizer = AutoTokenizer.from_pretrained("rinna/japanese-gpt-neox-3.6b")
161
+
162
+ print(good_tokenizer.decode(good_tokenizer.encode("გამარჯობა 吾輩は 猫である ")))
163
+ # 'გამარჯობა 吾輩は 猫である </s>'
164
+ print(bad_tokenizer.decode(bad_tokenizer.encode("გამარჯობა 吾輩は 猫である ")))
165
+ # 'გამარ[UNK]ობა 吾輩は 猫である </s>'
166
+ ~~~
167
+
168
+ # How to cite
169
+ ```bibtex
170
+ @misc{rinna-japanese-gpt-neox-3.6b-instruction-sft,
171
+ title = {rinna/japanese-gpt-neox-3.6b-instruction-sft},
172
+ author = {Zhao, Tianyu and Sawada, Kei},
173
+ url = {https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft}
174
+ }
175
+
176
+ @inproceedings{sawada2024release,
177
+ title = {Release of Pre-Trained Models for the {J}apanese Language},
178
+ author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
179
+ booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
180
+ month = {5},
181
+ year = {2024},
182
+ pages = {13898--13905},
183
+ url = {https://aclanthology.org/2024.lrec-main.1213},
184
+ note = {\url{https://arxiv.org/abs/2404.01657}}
185
+ }
186
+ ```
187
+
188
+ # Licenese
189
+ [The MIT license](https://opensource.org/licenses/MIT)
190
+