Add model card
Browse files- README.md +70 -0
- training.md +146 -0
README.md
ADDED
@@ -0,0 +1,70 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
-
|
4 |
+
-
|
5 |
+
thumbnail:
|
6 |
+
tags:
|
7 |
+
-
|
8 |
+
-
|
9 |
+
-
|
10 |
+
license:
|
11 |
+
datasets:
|
12 |
+
-
|
13 |
+
-
|
14 |
+
metrics:
|
15 |
+
-
|
16 |
+
-
|
17 |
+
---
|
18 |
+
|
19 |
+
# GPT-2 GERMAN
|
20 |
+
|
21 |
+
## Model description
|
22 |
+
|
23 |
+
TODO
|
24 |
+
## Intended uses & limitations
|
25 |
+
|
26 |
+
#### How to use
|
27 |
+
|
28 |
+
```python
|
29 |
+
# You can include sample code which will be formatted
|
30 |
+
```
|
31 |
+
|
32 |
+
#### Limitations and bias
|
33 |
+
|
34 |
+
Provide examples of latent issues and potential remediations.
|
35 |
+
|
36 |
+
## Training data
|
37 |
+
|
38 |
+
https://huggingface.co/datasets/german-nlp-group/german_common_crawl
|
39 |
+
|
40 |
+
```json
|
41 |
+
{'url': 'http://my-shop.ru/shop/books/545473.html',
|
42 |
+
'date_download': '2016-10-20T19:38:58Z',
|
43 |
+
'digest': 'sha1:F62EMGYLZDIKF4UL5JZYU47KWGGUBT7T',
|
44 |
+
'length': 1155,
|
45 |
+
'nlines': 4,
|
46 |
+
'source_domain': 'my-shop.ru',
|
47 |
+
'title': 'Grammatikalische Liebeslieder. Methodische Vorschläge',
|
48 |
+
'raw_content': 'Grammatikalische Liebeslieder. [....]',
|
49 |
+
'cc_segment': 'crawl-data/CC-MAIN-2016-44/segments/1476988717783.68/wet/CC-MAIN-20161020183837-00354-ip-10-171-6-4.ec2.internal.warc.wet.gz',
|
50 |
+
'original_nlines': 99,
|
51 |
+
'original_length': 2672,
|
52 |
+
'language': 'de',
|
53 |
+
'language_score': 1.0,
|
54 |
+
'perplexity': 283.0,
|
55 |
+
'bucket': 'head'}"
|
56 |
+
```
|
57 |
+
|
58 |
+
## Training procedure
|
59 |
+
|
60 |
+
TODO (See training.md)
|
61 |
+
|
62 |
+
## Eval results
|
63 |
+
|
64 |
+
### BibTeX entry and citation info
|
65 |
+
|
66 |
+
```bibtex
|
67 |
+
@inproceedings{...,
|
68 |
+
year={2021}
|
69 |
+
}
|
70 |
+
```
|
training.md
ADDED
@@ -0,0 +1,146 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<!---
|
2 |
+
Copyright 2021 The HuggingFace Team. All rights reserved.
|
3 |
+
|
4 |
+
Licensed under the Apache License, Version 2.0 (the "License");
|
5 |
+
you may not use this file except in compliance with the License.
|
6 |
+
You may obtain a copy of the License at
|
7 |
+
|
8 |
+
http://www.apache.org/licenses/LICENSE-2.0
|
9 |
+
|
10 |
+
Unless required by applicable law or agreed to in writing, software
|
11 |
+
distributed under the License is distributed on an "AS IS" BASIS,
|
12 |
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
13 |
+
See the License for the specific language governing permissions and
|
14 |
+
limitations under the License.
|
15 |
+
-->
|
16 |
+
|
17 |
+
# Language model training examples
|
18 |
+
|
19 |
+
The following example showcases how to train a language model from scratch
|
20 |
+
using the JAX/Flax backend.
|
21 |
+
|
22 |
+
JAX/Flax allows you to trace pure functions and compile them into efficient, fused accelerator code on both GPU and TPU.
|
23 |
+
Models written in JAX/Flax are **immutable** and updated in a purely functional
|
24 |
+
way which enables simple and efficient model parallelism.
|
25 |
+
|
26 |
+
## Causal language modeling
|
27 |
+
|
28 |
+
In the following, we demonstrate how to train an auto-regressive causal transformer model
|
29 |
+
in JAX/Flax.
|
30 |
+
More specifically, we pretrain a randomely initialized [**`gpt2`**](https://huggingface.co/gpt2) model in Norwegian on a single TPUv3-8.
|
31 |
+
to pre-train 124M [**`gpt2`**](https://huggingface.co/gpt2)
|
32 |
+
in Norwegian on a single TPUv3-8 pod.
|
33 |
+
|
34 |
+
The example script uses the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets.
|
35 |
+
|
36 |
+
Let's start by creating a model repository to save the trained model and logs.
|
37 |
+
Here we call the model `"norwegian-gpt2"`, but you can change the model name as you like.
|
38 |
+
|
39 |
+
You can do this either directly on [huggingface.co](https://huggingface.co/new) (assuming that
|
40 |
+
you are logged in) or via the command line:
|
41 |
+
|
42 |
+
```
|
43 |
+
huggingface-cli repo create norwegian-gpt2
|
44 |
+
```
|
45 |
+
|
46 |
+
Next we clone the model repository to add the tokenizer and model files.
|
47 |
+
|
48 |
+
```
|
49 |
+
git clone https://huggingface.co/<your-username>/norwegian-gpt2
|
50 |
+
```
|
51 |
+
|
52 |
+
To ensure that all tensorboard traces will be uploaded correctly, we need to
|
53 |
+
track them. You can run the following command inside your model repo to do so.
|
54 |
+
|
55 |
+
```
|
56 |
+
cd norwegian-gpt2
|
57 |
+
git lfs track "*tfevents*"
|
58 |
+
```
|
59 |
+
|
60 |
+
Great, we have set up our model repository. During training, we will automatically
|
61 |
+
push the training logs and model weights to the repo.
|
62 |
+
|
63 |
+
Next, let's add a symbolic link to the `run_clm_flax.py`.
|
64 |
+
|
65 |
+
```bash
|
66 |
+
export MODEL_DIR="./norwegian-gpt2"
|
67 |
+
ln -s ~/transformers/examples/flax/language-modeling/run_clm_flax.py run_clm_flax.py
|
68 |
+
```
|
69 |
+
|
70 |
+
### Train tokenizer
|
71 |
+
|
72 |
+
In the first step, we train a tokenizer to efficiently process the text input for the model. Similar to how it is shown in [How to train a new language model from scratch using Transformers and Tokenizers](https://huggingface.co/blog/how-to-train), we use a **`ByteLevelBPETokenizer`**.
|
73 |
+
The tokenizer is trained on the complete Norwegian dataset of OSCAR
|
74 |
+
and consequently saved in `${MODEL_DIR}`
|
75 |
+
This can take up to 10 minutes depending on your hardware ☕.
|
76 |
+
|
77 |
+
```python
|
78 |
+
from datasets import load_dataset
|
79 |
+
from tokenizers import trainers, Tokenizer, normalizers, ByteLevelBPETokenizer
|
80 |
+
|
81 |
+
model_dir = "./norwegian-roberta-base" # ${MODEL_DIR}
|
82 |
+
|
83 |
+
# load dataset
|
84 |
+
dataset = load_dataset("oscar", "unshuffled_deduplicated_no", split="train")
|
85 |
+
|
86 |
+
# Instantiate tokenizer
|
87 |
+
tokenizer = ByteLevelBPETokenizer()
|
88 |
+
|
89 |
+
def batch_iterator(batch_size=1000):
|
90 |
+
for i in range(0, len(dataset), batch_size):
|
91 |
+
yield dataset[i: i + batch_size]["text"]
|
92 |
+
|
93 |
+
# Customized training
|
94 |
+
tokenizer.train_from_iterator(batch_iterator(), vocab_size=50265, min_frequency=2, special_tokens=[
|
95 |
+
"<s>",
|
96 |
+
"<pad>",
|
97 |
+
"</s>",
|
98 |
+
"<unk>",
|
99 |
+
"<mask>",
|
100 |
+
])
|
101 |
+
|
102 |
+
# Save files to disk
|
103 |
+
tokenizer.save(f"{model_dir}/tokenizer.json")
|
104 |
+
```
|
105 |
+
### Create configuration
|
106 |
+
|
107 |
+
Next, we create the model's configuration file. This is as simple
|
108 |
+
as loading and storing [`**gpt2**`](https://huggingface.co/gpt2)
|
109 |
+
in the local model folder:
|
110 |
+
|
111 |
+
```python
|
112 |
+
from transformers import GPT2Config
|
113 |
+
|
114 |
+
model_dir = "./norwegian-gpt2" # ${MODEL_DIR}
|
115 |
+
|
116 |
+
config = GPT2Config.from_pretrained("gpt2", resid_pdrop=0.0, embd_pdrop=0.0, attn_pdrop=0.0)
|
117 |
+
config.save_pretrained(model_dir)
|
118 |
+
```
|
119 |
+
|
120 |
+
### Train model
|
121 |
+
|
122 |
+
Next we can run the example script to pretrain the model:
|
123 |
+
|
124 |
+
```bash
|
125 |
+
./run_clm_flax.py \
|
126 |
+
--output_dir="${MODEL_DIR}" \
|
127 |
+
--model_type="gpt2" \
|
128 |
+
--config_name="${MODEL_DIR}" \
|
129 |
+
--tokenizer_name="${MODEL_DIR}" \
|
130 |
+
--dataset_name="oscar" \
|
131 |
+
--dataset_config_name="unshuffled_deduplicated_no" \
|
132 |
+
--do_train --do_eval \
|
133 |
+
--block_size="512" \
|
134 |
+
--per_device_train_batch_size="64" \
|
135 |
+
--per_device_eval_batch_size="64" \
|
136 |
+
--learning_rate="5e-3" --warmup_steps="1000" \
|
137 |
+
--adam_beta1="0.9" --adam_beta2="0.98" --weight_decay="0.01" \
|
138 |
+
--overwrite_output_dir \
|
139 |
+
--num_train_epochs="20" \
|
140 |
+
--push_to_hub
|
141 |
+
```
|
142 |
+
|
143 |
+
Training should converge at a loss and perplexity
|
144 |
+
of 3.24 and 25.72 respectively after 20 epochs on a single TPUv3-8.
|
145 |
+
This should take less than ~21 hours.
|
146 |
+
Training statistics can be accessed on [tfhub.de](https://tensorboard.dev/experiment/2zEhLwJ0Qp2FAkI3WVH9qA).
|