uer commited on
Commit
1bcd7f4
·
1 Parent(s): c4d3724

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -28
README.md CHANGED
@@ -35,64 +35,70 @@ You can use the model directly with a pipeline for text generation:
35
 
36
  ## Training procedure
37
 
38
- The model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud TI-ONE](https://cloud.tencent.com/product/tione/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 1024.
39
 
40
  Stage1:
41
 
42
  ```
43
- python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
44
- --vocab_path models/google_zh_vocab.txt \
45
- --dataset_path cluecorpussmall_lm_seq128_dataset.pt \
46
  --seq_length 128 --processes_num 32 --target lm
47
  ```
48
 
49
  ```
50
- python3 pretrain.py --dataset_path cluecorpussmall_lm_seq128_dataset.pt \
51
- --vocab_path models/google_zh_vocab.txt \
52
- --config_path models/gpt2/config.json \
53
- --output_model_path models/cluecorpussmall_gpt2_seq128_model.bin \
54
- --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
55
- --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
56
- --learning_rate 1e-4 --batch_size 64 \
57
- --embedding word_pos --remove_embedding_layernorm \
58
- --encoder transformer --mask causal --layernorm_positioning pre \
59
  --target lm --tie_weight
60
  ```
61
 
62
  Stage2:
63
 
64
  ```
65
- python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
66
- --vocab_path models/google_zh_vocab.txt \
67
- --dataset_path cluecorpussmall_lm_seq1024_dataset.pt \
68
  --seq_length 1024 --processes_num 32 --target lm
69
  ```
70
 
71
  ```
72
- python3 pretrain.py --dataset_path cluecorpussmall_lm_seq1024_dataset.pt \
73
- --pretrained_model_path models/cluecorpussmall_gpt2_seq128_model.bin-1000000 \
74
- --vocab_path models/google_zh_vocab.txt \
75
- --config_path models/gpt2/config.json \
76
- --output_model_path models/cluecorpussmall_gpt2_seq1024_model.bin \
77
- --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
78
- --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \
79
- --learning_rate 5e-5 --batch_size 16 \
80
- --embedding word_pos --remove_embedding_layernorm \
81
- --encoder transformer --mask causal --layernorm_positioning pre \
82
  --target lm --tie_weight
83
  ```
84
 
85
  Finally, we convert the pre-trained model into Huggingface's format:
86
 
87
  ```
88
- python3 scripts/convert_gpt2_from_uer_to_huggingface.py --input_model_path cluecorpussmall_gpt2_seq1024_model.bin-250000 \
89
- --output_model_path pytorch_model.bin \
90
  --layers_num 12
91
  ```
92
 
93
  ### BibTeX entry and citation info
94
 
95
  ```
 
 
 
 
 
 
96
  @article{zhao2019uer,
97
  title={UER: An Open-Source Toolkit for Pre-training Models},
98
  author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
 
35
 
36
  ## Training procedure
37
 
38
+ The model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 1024.
39
 
40
  Stage1:
41
 
42
  ```
43
+ python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \\
44
+ --vocab_path models/google_zh_vocab.txt \\
45
+ --dataset_path cluecorpussmall_lm_seq128_dataset.pt \\
46
  --seq_length 128 --processes_num 32 --target lm
47
  ```
48
 
49
  ```
50
+ python3 pretrain.py --dataset_path cluecorpussmall_lm_seq128_dataset.pt \\
51
+ --vocab_path models/google_zh_vocab.txt \\
52
+ --config_path models/gpt2/config.json \\
53
+ --output_model_path models/cluecorpussmall_gpt2_seq128_model.bin \\
54
+ --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \\
55
+ --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \\
56
+ --learning_rate 1e-4 --batch_size 64 \\
57
+ --embedding word_pos --remove_embedding_layernorm \\
58
+ --encoder transformer --mask causal --layernorm_positioning pre \\
59
  --target lm --tie_weight
60
  ```
61
 
62
  Stage2:
63
 
64
  ```
65
+ python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \\
66
+ --vocab_path models/google_zh_vocab.txt \\
67
+ --dataset_path cluecorpussmall_lm_seq1024_dataset.pt \\
68
  --seq_length 1024 --processes_num 32 --target lm
69
  ```
70
 
71
  ```
72
+ python3 pretrain.py --dataset_path cluecorpussmall_lm_seq1024_dataset.pt \\
73
+ --pretrained_model_path models/cluecorpussmall_gpt2_seq128_model.bin-1000000 \\
74
+ --vocab_path models/google_zh_vocab.txt \\
75
+ --config_path models/gpt2/config.json \\
76
+ --output_model_path models/cluecorpussmall_gpt2_seq1024_model.bin \\
77
+ --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \\
78
+ --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \\
79
+ --learning_rate 5e-5 --batch_size 16 \\
80
+ --embedding word_pos --remove_embedding_layernorm \\
81
+ --encoder transformer --mask causal --layernorm_positioning pre \\
82
  --target lm --tie_weight
83
  ```
84
 
85
  Finally, we convert the pre-trained model into Huggingface's format:
86
 
87
  ```
88
+ python3 scripts/convert_gpt2_from_uer_to_huggingface.py --input_model_path cluecorpussmall_gpt2_seq1024_model.bin-250000 \\
89
+ --output_model_path pytorch_model.bin \\
90
  --layers_num 12
91
  ```
92
 
93
  ### BibTeX entry and citation info
94
 
95
  ```
96
+ @article{radford2019language,
97
+ title={Language Models are Unsupervised Multitask Learners},
98
+ author={Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
99
+ year={2019}
100
+ }
101
+
102
  @article{zhao2019uer,
103
  title={UER: An Open-Source Toolkit for Pre-training Models},
104
  author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},