noriyukipy
commited on
Commit
·
e1850c2
1
Parent(s):
c4da11c
Upload new model with the latest data on Aug20, 2021
Browse files- CHANGELOG.md +0 -16
- README.md +27 -22
- config.json +4 -2
- flax_model.msgpack +0 -3
- pytorch_model.bin +2 -2
- spiece.model +2 -2
- tf_model.h5 +2 -2
- tokenizer_config.json +1 -1
CHANGELOG.md
DELETED
@@ -1,16 +0,0 @@
|
|
1 |
-
# Changelog
|
2 |
-
|
3 |
-
## [v1]
|
4 |
-
|
5 |
-
### 2021-04-01
|
6 |
-
#### Added
|
7 |
-
- disclaimer and author in the "License" section
|
8 |
-
- CHANGELOG.md
|
9 |
-
|
10 |
-
### 2021-03-27
|
11 |
-
#### Modified
|
12 |
-
- config.json to set default generation parameters of top_k=50, top_p=0.95 and d_samples=True
|
13 |
-
|
14 |
-
### 2021-03-27
|
15 |
-
#### Added
|
16 |
-
- models and model card
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
README.md
CHANGED
@@ -2,18 +2,20 @@
|
|
2 |
language: ja
|
3 |
datasets: wikipedia
|
4 |
widget:
|
5 |
-
- text:
|
6 |
-
license: cc-by-sa-
|
7 |
---
|
8 |
|
9 |
# GPT-2 small Japanese model
|
10 |
|
11 |
-
This repository contains a
|
12 |
|
13 |
## Training data
|
14 |
|
15 |
-
[Japanese Wikipedia](https://ja.wikipedia.org/wiki/Wikipedia:データベースダウンロード) dataset
|
16 |
-
|
|
|
|
|
17 |
|
18 |
## Model description
|
19 |
|
@@ -23,42 +25,45 @@ The vocabulary size is set to 32,000 instead of an original size of 50,257.
|
|
23 |
|
24 |
## Tokenizer description
|
25 |
|
26 |
-
[SentencePiece](https://github.com/google/sentencepiece)
|
27 |
|
28 |
-
|
29 |
-
The vocabulary size
|
|
|
30 |
|
31 |
-
After training, the model
|
|
|
|
|
32 |
|
33 |
## Training
|
34 |
|
35 |
-
The model
|
36 |
-
|
37 |
-
|
|
|
|
|
|
|
38 |
|
39 |
-
|
40 |
|
41 |
## Usage
|
42 |
|
43 |
First, install dependecies.
|
44 |
|
45 |
```sh
|
46 |
-
$ pip install transformers==4.
|
47 |
```
|
48 |
|
49 |
-
Then
|
50 |
|
51 |
```sh
|
52 |
>>> import transformers
|
53 |
-
>>>
|
54 |
-
>>>
|
55 |
-
>>> input = tokenizer.encode("近年の機械学習は", return_tensors="pt")
|
56 |
-
>>> output = model.generate(input, do_sample=True, top_p=0.95, top_k=50, num_return_sequences=3)
|
57 |
-
>>> tokenizer.batch_decode(output)
|
58 |
-
['近年の機械学習は、特に、コンピューター学習において重要な概念である。この概念は、教育心理学', '近年の機械学習は時間間隔の短縮、時間間隔の短縮、学習時間の短縮、学習の', '近年の機械学習は、学生と学生が自分の能力を高め、結果を向上させることを目的としている。それは、']
|
59 |
```
|
60 |
|
61 |
-
**Note:** The default model configuration `config.json` sets
|
|
|
62 |
|
63 |
## License
|
64 |
|
|
|
2 |
language: ja
|
3 |
datasets: wikipedia
|
4 |
widget:
|
5 |
+
- text: 統計的機械学習でのニューラルネットワーク
|
6 |
+
license: cc-by-sa-3.0
|
7 |
---
|
8 |
|
9 |
# GPT-2 small Japanese model
|
10 |
|
11 |
+
This repository contains a GPT2-small model trained on Japanese Wikipedia dataset.
|
12 |
|
13 |
## Training data
|
14 |
|
15 |
+
[Japanese Wikipedia](https://ja.wikipedia.org/wiki/Wikipedia:データベースダウンロード) dataset as of Aug20, 2021 released under [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/) is used for both tokenizer and GPT-2 model.
|
16 |
+
|
17 |
+
We splitted the dataset into three subsets - train, valid and test sets. Both tokenizer and model were trained on the train set.
|
18 |
+
Train set contains around 540M tokens.
|
19 |
|
20 |
## Model description
|
21 |
|
|
|
25 |
|
26 |
## Tokenizer description
|
27 |
|
28 |
+
[SentencePiece](https://github.com/google/sentencepiece) is used as a tokenizer for this model.
|
29 |
|
30 |
+
We utilized 1,000,000 sentences from train set.
|
31 |
+
The vocabulary size was 32,000.
|
32 |
+
A `add_dummy_prefix` option was set to `True` because Japanese words are not separated by whitespaces.
|
33 |
|
34 |
+
After training, the tokenizer model was imported as `transformers.BERTGenerationTokenizer`
|
35 |
+
because it supports SentencePiece models and it does not add any special tokens as default,
|
36 |
+
which is useful expecially for a text generation task.
|
37 |
|
38 |
## Training
|
39 |
|
40 |
+
The model was trained on the train set for 30 epochs with batch size 32. Each sample contained 1024 tokens.
|
41 |
+
|
42 |
+
We utilized Adam optimizer. Learning rate was linearly increased from `0` to `1e-4` during the first 10,000 steps.
|
43 |
+
A clip norm was set to `1.0`.
|
44 |
+
|
45 |
+
Test set perplexity of the trained model was 29.13.
|
46 |
|
47 |
+
Please refer to [GitHub](https://github.com/colorfulscoop/gpt-ja) for more training details.
|
48 |
|
49 |
## Usage
|
50 |
|
51 |
First, install dependecies.
|
52 |
|
53 |
```sh
|
54 |
+
$ pip install transformers==4.10.0 torch==1.8.1 sentencepiece==0.1.96
|
55 |
```
|
56 |
|
57 |
+
Then use pipeline to generate sentences.
|
58 |
|
59 |
```sh
|
60 |
>>> import transformers
|
61 |
+
>>> pipeline = transformers.pipeline("text-generation", "colorfulscoop/gpt2-small-ja")
|
62 |
+
>>> pipeline("統計的機械学習でのニューラルネットワーク", do_sample=True, top_p=0.95, top_k=50, num_return_sequences=3)
|
|
|
|
|
|
|
|
|
63 |
```
|
64 |
|
65 |
+
**Note:** The default model configuration `config.json` sets parameters for text generation with `do_sample=True`, `top_k=50`, `top_p=0.95`.
|
66 |
+
Please set these parameters when you need to use different parameters.
|
67 |
|
68 |
## License
|
69 |
|
config.json
CHANGED
@@ -1,5 +1,5 @@
|
|
1 |
{
|
2 |
-
"_name_or_path": "
|
3 |
"activation_function": "gelu_new",
|
4 |
"architectures": [
|
5 |
"GPT2LMHeadModel"
|
@@ -21,6 +21,7 @@
|
|
21 |
"n_positions": 1024,
|
22 |
"pad_token_id": 0,
|
23 |
"resid_pdrop": 0.1,
|
|
|
24 |
"sep_token_id": 5,
|
25 |
"summary_activation": null,
|
26 |
"summary_first_dropout": 0.1,
|
@@ -28,7 +29,8 @@
|
|
28 |
"summary_type": "cls_index",
|
29 |
"summary_use_proj": true,
|
30 |
"tokenizer_class": "BertGenerationTokenizer",
|
31 |
-
"
|
|
|
32 |
"unk_token_id": 1,
|
33 |
"use_cache": true,
|
34 |
"vocab_size": 32000,
|
|
|
1 |
{
|
2 |
+
"_name_or_path": "models/gpt2-small",
|
3 |
"activation_function": "gelu_new",
|
4 |
"architectures": [
|
5 |
"GPT2LMHeadModel"
|
|
|
21 |
"n_positions": 1024,
|
22 |
"pad_token_id": 0,
|
23 |
"resid_pdrop": 0.1,
|
24 |
+
"scale_attn_weights": true,
|
25 |
"sep_token_id": 5,
|
26 |
"summary_activation": null,
|
27 |
"summary_first_dropout": 0.1,
|
|
|
29 |
"summary_type": "cls_index",
|
30 |
"summary_use_proj": true,
|
31 |
"tokenizer_class": "BertGenerationTokenizer",
|
32 |
+
"torch_dtype": "float32",
|
33 |
+
"transformers_version": "4.10.0",
|
34 |
"unk_token_id": 1,
|
35 |
"use_cache": true,
|
36 |
"vocab_size": 32000,
|
flax_model.msgpack
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:f16a0793a705aed9f18b790ed1664e3d536e293d0bc93e1ce9e495000249684e
|
3 |
-
size 441678616
|
|
|
|
|
|
|
|
pytorch_model.bin
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:1e95f3fb022ae9e599aaacf7eb9b69cc2194b817da1d07dca8de6ec6127d33d4
|
3 |
+
size 454320757
|
spiece.model
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:ec01688f1fdf79d9596d099bfe2cd1ec8d0871e848660ca643907a3e4a5fb97f
|
3 |
+
size 802969
|
tf_model.h5
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:20683b6b0ee0f9c2e6b017085fa169ae4ea62af56b46ff17334e312e6b8b1849
|
3 |
+
size 441849416
|
tokenizer_config.json
CHANGED
@@ -1 +1 @@
|
|
1 |
-
{"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>", "sep_token": "<sep>", "cls_token": "<cls>"}
|
|
|
1 |
+
{"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>", "sep_token": "<sep>", "sp_model_kwargs": {}, "cls_token": "<cls>", "special_tokens_map_file": "models/small-v2/special_tokens_map.json", "tokenizer_file": null, "name_or_path": "models/small-v2/", "tokenizer_class": "BertGenerationTokenizer"}
|