Hello everyone I really hope this is the correct category for this question. I’m using TFAutoModelForSequenceClassification to perform a multi labeling task on a dataset. This dataset has a text and 20 columns, one for each class. If an example has a 1 on a column it means that it belongs to that class, an example could be:
‘It’s cold today’ 0 0 1 1 0 1 0
I loaded the Dataframe into a HF dataset and I loaded the model and tokenizer with:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
bert = TFAutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_labels, problem_type='multi_label_classification', id2label=id2labels, label2id=labels2id)
I then proceeded to convert the dataset into:
‘Tokenized text’ | ‘Array of 0s and 1s’
To do this I wrote this code:
def tokenize_and_encode(val, tokenizer, max_length):
tokenized = tokenizer(val['Premise'], truncation=True, padding='max_length', max_length=max_length)
labels = []
for index in id2labels.keys():
# Convert the columns into a single array of zeros and ones
labels.append(val[str(index)])
return {'input_ids': tokenized['input_ids'],
'attention_mask': tokenized['attention_mask'],
'labels': labels}
train_dataset = Dataset.from_pandas(train_df)
train_dataset = train_dataset.map(lambda x: tokenize_and_encode(x, tokenizer, 200), remove_columns=train_dataset.column_names)
I then prepared the dataset and the model for the training phase:
batch_size = 16
num_epochs = 3
batches_per_epoch = len(train_dataset) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)
tf_train_set = bert.prepare_tf_dataset(
train_dataset,
shuffle=True,
batch_size=batch_size
)
bert.compile(optimizer=optimizer)
Until now I have no problems in running the code. But when I call:
history = bert.fit(tf_train_set, epochs=5)
I receive a very big error, but I think the most important part is:
ValueError: `labels.shape` must equal `logits.shape` except for the last dimension. Received: labels.shape=(320,) and logits.shape=(16, 20)
Call arguments received by layer "tf_bert_for_sequence_classification" (type TFBertForSequenceClassification):
• self={'input_ids': 'tf.Tensor(shape=(16, 200), dtype=int64)', 'attention_mask': 'tf.Tensor(shape=(16, 200), dtype=int64)', 'labels': 'tf.Tensor(shape=(16, 20), dtype=int64)'}
• input_ids=None
• attention_mask=None
• token_type_ids=None
• position_ids=None
• head_mask=None
• inputs_embeds=None
• output_attentions=None
• output_hidden_states=None
• return_dict=None
• labels=None
• training=True
The model states that it received labels of shape 320 when I should have provided a shape of (16, 20) and the line below the error states that I indeed provided a shape of (16, 20). It’s like my labels are being flattened? I can’t understand what’s happening.
Thank you very much to all of you.