I was using the BERT implementation made here before:
But this implementation seemed to not be exactly following the standard procedure for BERT when it came to labels/attentions. Which made me try out huggingface’s version.
Now I have both of them running on the same data, and as far as I can see in an identical way. However their learning speed seems very different!
With the huggingface implementation my code trains to 20% accuracy on masked learning in about 7000 iterations. While with the other implementation I’m reaching 75% accuracy at that point. Their training time also seems comparable.
I can see that the two BERT models are slightly different (In the following I’m printing the network for a 1 layer model just to get something that is easier to compare):
HuggingfaceBERT has 8097792 parameters
model
BertForMaskedLM(
(bert): BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(30, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
(cls): BertOnlyMLMHead(
(predictions): BertLMPredictionHead(
(transform): BertPredictionHeadTransform(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(decoder): Linear(in_features=768, out_features=30, bias=True)
)
)
)
Other BERT has 7136286 trainable parameters.
BERTseq(
(bert): BERT(
(embedding): BERTEmbedding(
(token): TokenEmbedding(30, 768, padding_idx=0)
(position): PositionalEmbedding()
(segment): SegmentEmbedding(3, 768, padding_idx=0)
(dropout): Dropout(p=0.1, inplace=False)
)
(transformer_blocks): ModuleList(
(0): TransformerBlock(
(attention): MultiHeadedAttention(
(linear_layers): ModuleList(
(0): Linear(in_features=768, out_features=768, bias=True)
(1): Linear(in_features=768, out_features=768, bias=True)
(2): Linear(in_features=768, out_features=768, bias=True)
)
(output_linear): Linear(in_features=768, out_features=768, bias=True)
(attention): Attention()
(dropout): Dropout(p=0.1, inplace=False)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=768, out_features=3072, bias=True)
(w_2): Linear(in_features=3072, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(activation): GELU()
)
(input_sublayer): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(output_sublayer): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(mask_lm): MaskedLanguageModel(
(linear): Linear(in_features=768, out_features=30, bias=True)
)
)
Does anyone have any idea what could be going on here? I’m not an expert in BERT, so I can’t really tell whether any of the differences are crucial or what exactly is going on.