So I am trying to train an automated essay scoring system, that combines the loss of predicting scores with predicting whether a sentence is grammatically correct. To do this I have split each sentence in the essay with a sep and a cls token so that an essay is fed into Bert like this:
essay 1 : [cls] … sent 1 … [sep][cls] … sent 2 … [sep][cls] … sent 3 … [sep][cls] … etc
essay 2 : [cls] … sent 1 … [sep][cls] … sent 2 … [sep][cls] … sent 3 … [sep][cls] … etc
As well as a list of labels for each sentence whether it contains a grammatical error or not and a score for the essay i.e
essay 1 : labels: [1,0,0,1,etc…],score:38
essay 2 : labels: [1,1,0,1,etc…],score:24
(So that the list of labels can be passed into a dataset they must have the same length as the input_ids, attention_masks etc. so I have padded them with 2’s at each word that is not a cls token, so they actually look something like this…
[1,2,2,0,2,2,2,0,2,2,2,2,1,etc…]
which would correspond to a sent
[cls,w,w,cls,w,w,w,cls,w,w,w,w,cls…]
(w=word)
I then use this to get the index of the cls tokens in each essay.
So the performance of the model on the grammatical error detection is comparable to feeding in a sentence normally to bert:
sent 1 : [cls] … sent 1 … [sep]
sent2 : [cls] … sent 2 … [sep]
However, unsurprisingly the additional sep and cls tokens decrease the model performance when compared to feeding an essay into Bert normally:
essay 1 : [cls] … essay 1 … [sep]
essay 2 : [cls] … essay 2 … [sep]
So to combat this I am trying to use the vector representation of each cls token in the output of each essay as input to another smaller transformer model as done in this paper here https://arxiv.org/pdf/1903.10318.pdf). However I cannot figure out how to do this, I have tried feeding the output into a embedding layer with input_embeds = True but this does not work.
My is a simplified version of my code so far with just a mini batch I have used as a test:
# encoded_dataset_train is my training dataset type = (datasets.arrow_dataset.Dataset)
mini_batch = encoded_dataset_train[:2]
print(mini_batch)
'''{'attention_mask': tensor([[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0]]),
'input_ids': tensor([[ 0, 23314, 5348, ..., 1, 1, 1],
[ 0, 1360, 73, ..., 1, 1, 1]]),
'label': tensor([[1, 2, 2, ..., 2, 2, 2],
[1, 2, 2, ..., 2, 2, 2]]),
'score': tensor([31, 23])}'''
from transformers import AutoModel
model = AutoModel.from_pretrained('distilroberta-base')
# get the output of the final attention layer of the model
input_ids = mini_batch['input_ids']
attention_mask = mini_batch['attention_mask']
output = model(input_ids=input_ids,attention_mask=attention_mask)
print(output.shape)
'''
torch.Size([2, 512, 768])
'''
# get a mask where each cls token in each batch is 1 and 0 for all other tokens
bs,tok_len = mini_batch['label'].shape
active_labels = torch.where(mini_batch['label']<2,1,0).reshape(bs,tok_len,1)expand(active_labels.shape[0],active_labels.shape[1],768)
print(active_labels.shape)
'''
torch.Size([2, 512, 768])
'''
# multiply the output by the mask
active_loss = output*active_labels
print(active_loss.shape)
'''
torch.Size([2, 512, 768])
'''
# Create my smaller model (so far only extracted the layers, I want have not combined them yet)
embeds = model.embeddings
layers = model.encoder.layer[:2]
classifier = model.classifier
# Result for trying to pass the output into the embedding layer
embeds.forward(input_ids=active_loss,inputs_embeds=True)
'''
RuntimeError: The size of tensor a (512) must match the size of tensor b (768) at non-singleton dimension 2
'''
Cheers in advance