Transformers
PyTorch
Graphcore
English
groupbert
Generated from Trainer
Inference Endpoints
ivanc commited on
Commit
feeaeac
·
1 Parent(s): 0485028

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +138 -7
README.md CHANGED
@@ -1,35 +1,166 @@
1
  ---
2
  tags:
3
  - generated_from_trainer
 
 
 
4
  model-index:
5
  - name: groupbert-base-uncased
6
  results: []
 
 
 
7
  ---
8
 
9
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
10
  should probably proofread and complete it, then remove this comment. -->
11
 
12
- # groupbert-base-uncased
13
 
14
- This model was trained from scratch on the None dataset.
 
 
15
 
16
  ## Model description
17
 
18
- More information needed
 
 
 
 
 
19
 
20
  ## Intended uses & limitations
 
21
 
22
- More information needed
 
23
 
24
  ## Training and evaluation data
25
 
26
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
  ## Training procedure
29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
  ### Training hyperparameters
31
 
32
- The following hyperparameters were used during training:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  - learning_rate: 0.01
34
  - train_batch_size: 2
35
  - eval_batch_size: 1
@@ -53,4 +184,4 @@ The following hyperparameters were used during training:
53
  - Transformers 4.20.1
54
  - Pytorch 1.10.0+cpu
55
  - Datasets 2.6.1
56
- - Tokenizers 0.12.1
 
1
  ---
2
  tags:
3
  - generated_from_trainer
4
+ datasets:
5
+ - Graphcore/wikipedia-bert-128
6
+ - Graphcore/wikipedia-bert-512
7
  model-index:
8
  - name: groupbert-base-uncased
9
  results: []
10
+ license: apache-2.0
11
+ language:
12
+ - en
13
  ---
14
 
15
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
16
  should probably proofread and complete it, then remove this comment. -->
17
 
18
+ # Graphcore/groupbert-base-uncased
19
 
20
+ Optimum Graphcore is a new open-source library and toolkit that enables developers to access IPU-optimized models certified by Hugging Face. It is an extension of Transformers, providing a set of performance optimization tools enabling maximum efficiency to train and run models on Graphcore’s IPUs - a completely new kind of massively parallel processor to accelerate machine intelligence. Learn more about how to take train Transformer models faster with IPUs at [hf.co/hardware/graphcore](https://huggingface.co/hardware/graphcore).
21
+
22
+ Through HuggingFace Optimum, Graphcore released ready-to-use IPU-trained model checkpoints and IPU configuration files to make it easy to train models with maximum efficiency in the IPU. Optimum shortens the development lifecycle of your AI models by letting you plug-and-play any public dataset and allows a seamless integration to our State-of-the-art hardware giving you a quicker time-to-value for your AI project.
23
 
24
  ## Model description
25
 
26
+ GroupBERT (Bidirectional Encoder Representations from Transformers) is a transformers model which is designed by Graphcore to pretrain bidirectional representations from unlabelled texts. GroupBERT uses grouped convolutions and matmuls in the encoder, which allows to parallelize computation and achieve higher parameter efficiency. More details are described in the [GroupBERT paper](https://arxiv.org/pdf/2106.05822.pdf).
27
+
28
+ It was trained with two objectives in pretraining : Masked language modelling (MLM) and Next sentence prediction(NSP). First, MLM is different from traditional LM which sees the words one after another while BERT allows the model to learn a bidirectional representation. In addition to MLM, NSP is used for jointly pertaining text-pair representations. Similarly to BERT it enables easy and fast fine-tuning for different downstream tasks such as Sequence Classification, Named Entity Recognition, Question Answering, Multiple Choice and MaskedLM.
29
+
30
+ It reduces the need of many engineering efforts for building task specific architectures through pre-trained representation. And achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks.
31
+
32
 
33
  ## Intended uses & limitations
34
+ This model is a pre-trained GroupBERT-Base trained in two phases on the [Graphcore/wikipedia-bert-128](https://huggingface.co/datasets/Graphcore/wikipedia-bert-128) and [Graphcore/wikipedia-bert-512](https://huggingface.co/datasets/Graphcore/wikipedia-bert-512) datasets.
35
 
36
+ It was trained on a Graphcore IPU-POD16 using [`optimum-graphcore`](https://github.com/huggingface/optimum-graphcore).
37
+ Graphcore and Hugging Face are working together to make training of Transformer models on IPUs fast and easy. Learn more about how to take advantage of the power of Graphcore IPUs to train Transformers models at [hf.co/hardware/graphcore](https://huggingface.co/hardware/graphcore).
38
 
39
  ## Training and evaluation data
40
 
41
+ Trained on wikipedia datasets:
42
+ - [Graphcore/wikipedia-bert-128](https://huggingface.co/datasets/Graphcore/wikipedia-bert-128)
43
+ - [Graphcore/wikipedia-bert-512](https://huggingface.co/datasets/Graphcore/wikipedia-bert-512)
44
+
45
+ ## Fine-tuning with these weights
46
+
47
+ These weights can be used in either `transformers` or [`optimum-graphcore`](https://github.com/huggingface/optimum-graphcore).
48
+
49
+ For example, to fine-tune the SQuAD v1 with `optimum-graphcore` you can do:
50
+
51
+ ```
52
+ python examples/question-answering/run_qa.py \
53
+ --model_name_or_path Graphcore/groupbert-base-uncased \
54
+ --ipu_config_name Graphcore/groupbert-base-uncased \
55
+ --dataset_name squad \
56
+ --version_2_with_negative False \
57
+ --do_train \
58
+ --do_eval \
59
+ --pad_on_batch_axis \
60
+ --num_train_epochs 1 \
61
+ --per_device_train_batch_size 1 \
62
+ --per_device_eval_batch_size 16 \
63
+ --gradient_accumulation_steps 10 \
64
+ --pod_type pod16 \
65
+ --learning_rate 4e-4 \
66
+ --max_seq_length 384 \
67
+ --doc_stride 128 \
68
+ --seed 42 \
69
+ --lr_scheduler_type linear \
70
+ --lamb \
71
+ --loss_scaling 64 \
72
+ --weight_decay 0.01 \
73
+ --warmup_ratio 0.1 \
74
+ --logging_steps 5 \
75
+ --save_steps -1 \
76
+ --dataloader_num_workers 64 \
77
+ --output_dir output/squad_groupbert_base
78
+ ```
79
 
80
  ## Training procedure
81
 
82
+ Trained MLM and NSP pre-training scheme from [Large Batch Optimization for Deep Learning: Training BERT in 76 minutes](https://arxiv.org/abs/1904.00962).
83
+ Trained on a Graphcore IPU-POD16 using [`optimum-graphcore`](https://github.com/huggingface/optimum-graphcore).
84
+
85
+ It was trained with the IPUConfig [Graphcore/bert-base-ipu](https://huggingface.co/Graphcore/bert-base-ipu/).
86
+
87
+ Command lines:
88
+
89
+ Phase 1:
90
+ ```
91
+ python examples/language-modeling/run_pretraining.py \
92
+ --model_type groupbert \
93
+ --tokenizer_name bert-base-uncased \
94
+ --ipu_config_name Graphcore/bert-base-ipu \
95
+ --dataset_name Graphcore/wikipedia-bert-128 \
96
+ --do_train \
97
+ --logging_steps 5 \
98
+ --max_seq_length 128 \
99
+ --max_steps 10500 \
100
+ --is_already_preprocessed \
101
+ --dataloader_num_workers 64 \
102
+ --dataloader_mode async_rebatched \
103
+ --lamb \
104
+ --per_device_train_batch_size 8 \
105
+ --gradient_accumulation_steps 2000 \
106
+ --pod_type pod16 \
107
+ --learning_rate 0.012 \
108
+ --loss_scaling 16384 \
109
+ --weight_decay 0.01 \
110
+ --warmup_ratio 0.15 \
111
+ --groupbert_schedule \
112
+ --config_overrides "hidden_dropout_prob=0.0,attention_probs_dropout_prob=0.0" \
113
+ --ipu_config_overrides device_iterations="1,matmul_proportion=0.22,layers_per_ipu=[1 3 4 4]" \
114
+ --output_dir output-pretrain-groupbert-base-phase1
115
+ ```
116
+
117
+ Phase 2:
118
+ ```
119
+ python examples/language-modeling/run_pretraining.py \
120
+ --model_type groupbert \
121
+ --tokenizer_name bert-base-uncased \
122
+ --ipu_config_name Graphcore/bert-base-ipu \
123
+ --dataset_name Graphcore/wikipedia-bert-512 \
124
+ --model_name_or_path ./output-pretrain-bert-base-phase1 \
125
+ --do_train \
126
+ --logging_steps 5 \
127
+ --max_seq_length 512 \
128
+ --max_steps 2038 \
129
+ --is_already_preprocessed \
130
+ --dataloader_num_workers 128 \
131
+ --dataloader_mode async_rebatched \
132
+ --lamb \
133
+ --per_device_train_batch_size 2 \
134
+ --gradient_accumulation_steps 2048 \
135
+ --pod_type pod16 \
136
+ --learning_rate 0.01 \
137
+ --loss_scaling 128 \
138
+ --weight_decay 0.01 \
139
+ --warmup_ratio 0.15 \
140
+ --groupbert_schedule \
141
+ --config_overrides "hidden_dropout_prob=0.0,attention_probs_dropout_prob=0.0" \
142
+ --ipu_config_overrides "device_iterations=1,embedding_serialization_factor=2,matmul_proportion=0.22,layers_per_ipu=[1 3 4 4]" \
143
+ --output_dir output-pretrain-groupbert-base-phase2
144
+ ```
145
+
146
  ### Training hyperparameters
147
 
148
+ The following hyperparameters were used during phase 1 training:
149
+ - learning_rate: 0.012
150
+ - train_batch_size: 8
151
+ - eval_batch_size: 1
152
+ - seed: 42
153
+ - distributed_type: IPU
154
+ - gradient_accumulation_steps: 200
155
+ - total_train_batch_size: 64000
156
+ - total_eval_batch_size: 20
157
+ - optimizer: LAMB
158
+ - lr_scheduler_type: linear
159
+ - lr_scheduler_warmup_ratio: 0.15
160
+ - training_steps: 10500
161
+ - training precision: Mixed Precision
162
+
163
+ The following hyperparameters were used during phase 2 training:
164
  - learning_rate: 0.01
165
  - train_batch_size: 2
166
  - eval_batch_size: 1
 
184
  - Transformers 4.20.1
185
  - Pytorch 1.10.0+cpu
186
  - Datasets 2.6.1
187
+ - Tokenizers 0.12.1