I have a dataset consisting of two fields (“text” and “label”) and tow splits (“train” and “test”). Both text and label are of type string.
I need to encode the labels, I have a large number of classes and I need to discover them at train time, the follow code works fine:
label_encoder = LabelEncoder()
labels = list(chain(
(label for label in dataset['train']['label']),
(label for label in dataset['test']['label'])))
label_encoder.fit(labels)
num_classes = len(label_encoder.classes_)
id2label = {id: label
for id, label in enumerate(label_encoder.classes_.tolist())}
label2id = {label: id
for (id, label) in id2label.items()}
def tokenize(batch):
tokens = tokenizer(batch["text"], truncation=True, max_length=512)
tokens['label'] = [label2id[label] for label in batch['label']]
return tokens
I’m aware that datasets has built in functionality for “categorical” columns but I cannot get it to work - I have a feeling that its assigning different ids to the same label in the test and train set, i.e. my code is
dataset = dataset.class_encode_column("label")
num_classes = dataset['train'].features['label'].num_classes
id2label = {id:dataset['train'].features['label'].int2str(id) for id in range(num_classes)}
label2id = {label:id for (id,label) in id2label.items()}
With this code I removed the label encoding from my tokenize
function, i.e.
def tokenize(batch):
tokens = tokenizer(batch["text"], truncation=True, max_length=512)
return tokens
In both cases I initialise the model as follows:
model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=num_classes, id2label=id2label, label2id=label2id
)
In the first case I get 73% accuracy for Epoch 1 (evaluated on the test set). When I use class_encode_column
I get 2%. If I train and evaluate using only the test set I get much higher accuracy. This implies to me that either it’s encoding both test and train independently or I’m doing something else wrong?
Can you please advise how to properly create a ClassLabel for a dataset that is already split? And when training a model with a ClassLabel as the target is there any additional considerations?