RichardErkhov commited on
Commit
30fac60
·
verified ·
1 Parent(s): 891820b

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +206 -0
README.md ADDED
@@ -0,0 +1,206 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ japanese-gpt-neox-3.6b-instruction-sft-v2 - bnb 4bits
11
+ - Model creator: https://huggingface.co/rinna/
12
+ - Original model: https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft-v2/
13
+
14
+
15
+
16
+
17
+ Original model description:
18
+ ---
19
+ language: ja
20
+ thumbnail: https://github.com/rinnakk/japanese-pretrained-models/blob/master/rinna.png
21
+ tags:
22
+ - gpt_neox
23
+ - text-generation
24
+ - lm
25
+ - nlp
26
+ license: mit
27
+ datasets:
28
+ - Anthropic/hh-rlhf
29
+ - stanfordnlp/SHP
30
+ inference: false
31
+ base_model: rinna/japanese-gpt-neox-3.6b
32
+ ---
33
+
34
+ # japanese-gpt-neox-3.6b-instruction-sft-v2
35
+
36
+ ![rinna-icon](./rinna.png)
37
+
38
+ # Overview
39
+ This repository provides a Japanese GPT-NeoX model of 3.6 billion parameters. The model is based on [`rinna/japanese-gpt-neox-3.6b`](https://huggingface.co/rinna/japanese-gpt-neox-3.6b) and has been finetuned to serve as an instruction-following conversational agent.
40
+
41
+ This model slightly differs from the previous SFT model [`rinna/japanese-gpt-neox-3.6b-instruction-sft`](https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft), where a different data split is used for training.
42
+
43
+ * **Model architecture**
44
+
45
+ A 36-layer, 2816-hidden-size transformer-based language model.
46
+
47
+ * **SFT vs. previous SFT evaluation**
48
+
49
+ We conducted ChatGPT-based automated evaluation on 100 prompts to assess the performance difference between this SFT model and the previous SFT model.
50
+
51
+ | [this SFT](https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft-v2) vs. [previous SFT](https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft) | win | tie | loss |
52
+ | :---: | :---: | :---: | :---: |
53
+ | ChatGPT auto. evaluation | **55**% | 0% | 45% |
54
+
55
+ * **Finetuning**
56
+
57
+ The finetuning data is the subset of the following datasets and has been translated into Japanese.
58
+ * [Anthropic HH RLHF data](https://huggingface.co/datasets/Anthropic/hh-rlhf)
59
+ * [FLAN Instruction Tuning data](https://github.com/google-research/FLAN)
60
+ * [Stanford Human Preferences Dataset](https://huggingface.co/datasets/stanfordnlp/SHP)
61
+
62
+ The data will **not** be released.
63
+
64
+ * **Model Series**
65
+
66
+ | Variant | Link |
67
+ | :-- | :--|
68
+ | 3.6B PPO | https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-ppo |
69
+ | 3.6B SFT-v2 | https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft-v2 |
70
+ | 3.6B SFT | https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft |
71
+ | 3.6B pretrained | https://huggingface.co/rinna/japanese-gpt-neox-3.6b |
72
+
73
+ * **Contributors**
74
+
75
+ [Tianyu Zhao](https://huggingface.co/tianyuz) and [Kei Sawada](https://huggingface.co/keisawada)
76
+
77
+ # I/O Format
78
+
79
+ A special format has been adopted to construct inputs.
80
+ * An input prompt is formatted as a conversation between `ユーザー` and `システム`.
81
+ * Each input utterance consists of (1) its speaker (`"ユーザー"` or `"システム"`), (2) a colon (`":"`), (3) a whitespace (`" "`), and (4) utterance text (e.g. `"世界で一番高い山は?"`).
82
+ * The input prompt should be ended with `"システム: "` to acknowledge the model to generate a response.
83
+ * Since the model's tokenizer does not recognize `"\n"`, a special newline symbol `"<NL>"` is used instead.
84
+ * All the newlines in input and output utterances should be replaced with `"<NL>"`.
85
+ * All the utterances in the input prompt should be separated by `"<NL>"`.
86
+
87
+ Following is an example to construct an input from a conversation.
88
+ ~~~python
89
+ prompt = [
90
+ {
91
+ "speaker": "ユーザー",
92
+ "text": "コンタクトレンズを慣れるにはどうすればよいですか?"
93
+ },
94
+ {
95
+ "speaker": "システム",
96
+ "text": "これについて具体的に説明していただけますか?何が難しいのでしょうか?"
97
+ },
98
+ {
99
+ "speaker": "ユーザー",
100
+ "text": "目が痛いのです。"
101
+ },
102
+ {
103
+ "speaker": "システム",
104
+ "text": "分かりました、コンタクトレンズをつけると目がかゆくなるということですね。思った以上にレンズを外す必要があるでしょうか?"
105
+ },
106
+ {
107
+ "speaker": "ユーザー",
108
+ "text": "いえ、レンズは外しませんが、目が赤くなるんです。"
109
+ }
110
+ ]
111
+ prompt = [
112
+ f"{uttr['speaker']}: {uttr['text']}"
113
+ for uttr in prompt
114
+ ]
115
+ prompt = "<NL>".join(prompt)
116
+ prompt = (
117
+ prompt
118
+ + "<NL>"
119
+ + "システム: "
120
+ )
121
+ print(prompt)
122
+ # "ユーザー: コンタクトレンズを慣れるにはどうすればよいですか?<NL>システム: これについて具体的に説明していただけますか?何が難しいのでしょうか?<NL>ユーザー: 目が痛いのです。<NL>システム: 分かりました、コンタクトレンズをつけると目がかゆくなるということ��すね。思った以上にレンズを外す必要があるでしょうか?<NL>ユーザー: いえ、レンズは外しませんが、目が赤くなるんです。<NL>システム: "
123
+ ~~~
124
+
125
+ # How to use the model
126
+
127
+ ~~~~python
128
+ import torch
129
+ from transformers import AutoTokenizer, AutoModelForCausalLM
130
+
131
+ tokenizer = AutoTokenizer.from_pretrained("rinna/japanese-gpt-neox-3.6b-instruction-sft-v2", use_fast=False)
132
+ model = AutoModelForCausalLM.from_pretrained("rinna/japanese-gpt-neox-3.6b-instruction-sft-v2")
133
+
134
+ if torch.cuda.is_available():
135
+ model = model.to("cuda")
136
+
137
+ token_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
138
+
139
+ with torch.no_grad():
140
+ output_ids = model.generate(
141
+ token_ids.to(model.device),
142
+ do_sample=True,
143
+ max_new_tokens=128,
144
+ temperature=0.7,
145
+ repetition_penalty=1.1,
146
+ pad_token_id=tokenizer.pad_token_id,
147
+ bos_token_id=tokenizer.bos_token_id,
148
+ eos_token_id=tokenizer.eos_token_id
149
+ )
150
+
151
+ output = tokenizer.decode(output_ids.tolist()[0][token_ids.size(1):])
152
+ output = output.replace("<NL>", "\n")
153
+ print(output)
154
+ """わかりました。まずは、コンタクトレンズを長時間着用することによる目の乾燥を防ぐことができます。また、毎日同じ時間帯にコンタクトレンズを着用してみることもできます。そして、コンタクトレンズが目に合わないような場合は、新しいものを試してみる必要があります。</s>"""
155
+ ~~~~
156
+
157
+ # Tokenization
158
+ The model uses a [sentencepiece](https://github.com/google/sentencepiece)-based tokenizer.
159
+ * The tokenizer has a vocabulary size of 32,000.
160
+ * It uses sentencepiece's byte fallback feature to decompose unknown text pieces into UTF-8 byte pieces and to avoid producing `<UNK>` tokens.
161
+ * sentencepiece's `--add_dummy_prefix` option was turned off so that a leading whitespace will not be prepended automatically.
162
+ ~~~
163
+ print(tokenizer.tokenize("吾輩は猫である"))
164
+ # ['吾', '輩', 'は', '猫', 'である']
165
+ # instead of ['▁', '吾', '輩', 'は', '猫', 'である'] as in rinna/japanese-gpt-1b
166
+ ~~~
167
+ * sentencepiece's `--remove_extra_whitespaces` option was turned off so that leading, trailing, and duplicate whitespaces are reserved.
168
+ ~~~
169
+ print(tokenizer.tokenize(" 吾輩は 猫である "))
170
+ # ['▁', '▁', '吾', '輩', 'は', '▁', '▁', '猫', 'である', '▁', '▁', '▁']
171
+ # instead of ['▁', '吾', '輩', 'は', '▁猫', 'である'] as in rinna/japanese-gpt-1b
172
+ ~~~
173
+ * Don't forget to set `use_fast=False` to make the above features function correctly.
174
+ ~~~
175
+ good_tokenizer = AutoTokenizer.from_pretrained("rinna/japanese-gpt-neox-3.6b", use_fast=False)
176
+ bad_tokenizer = AutoTokenizer.from_pretrained("rinna/japanese-gpt-neox-3.6b")
177
+
178
+ print(good_tokenizer.decode(good_tokenizer.encode("გამარჯობა 吾輩は 猫である ")))
179
+ # 'გამარჯობა 吾輩は 猫である </s>'
180
+ print(bad_tokenizer.decode(bad_tokenizer.encode("გამარჯობა 吾輩は 猫である ")))
181
+ # 'გამარ[UNK]ობა 吾輩は 猫である </s>'
182
+ ~~~
183
+
184
+ # How to cite
185
+ ```bibtex
186
+ @misc{rinna-japanese-gpt-neox-3.6b-instruction-sft-v2,
187
+ title = {rinna/japanese-gpt-neox-3.6b-instruction-sft-v2},
188
+ author = {Zhao, Tianyu and Sawada, Kei},
189
+ url = {https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft-v2}
190
+ }
191
+
192
+ @inproceedings{sawada2024release,
193
+ title = {Release of Pre-Trained Models for the {J}apanese Language},
194
+ author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
195
+ booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
196
+ month = {5},
197
+ year = {2024},
198
+ pages = {13898--13905},
199
+ url = {https://aclanthology.org/2024.lrec-main.1213},
200
+ note = {\url{https://arxiv.org/abs/2404.01657}}
201
+ }
202
+ ```
203
+
204
+ # Licenese
205
+ [The MIT license](https://opensource.org/licenses/MIT)
206
+