Correct use of dataset.class_encode_column

david-waterworth · July 15, 2023, 1:48am

I have a dataset consisting of two fields (“text” and “label”) and tow splits (“train” and “test”). Both text and label are of type string.

I need to encode the labels, I have a large number of classes and I need to discover them at train time, the follow code works fine:

label_encoder = LabelEncoder()
labels = list(chain(
    (label for label in dataset['train']['label']),
    (label for label in dataset['test']['label'])))
label_encoder.fit(labels)
num_classes = len(label_encoder.classes_)
id2label = {id: label 
            for id, label in enumerate(label_encoder.classes_.tolist())}
label2id = {label: id 
            for (id, label) in id2label.items()}

def tokenize(batch):
    tokens = tokenizer(batch["text"], truncation=True, max_length=512)
    tokens['label'] = [label2id[label] for label in batch['label']]

    return tokens

I’m aware that datasets has built in functionality for “categorical” columns but I cannot get it to work - I have a feeling that its assigning different ids to the same label in the test and train set, i.e. my code is

dataset = dataset.class_encode_column("label")
num_classes = dataset['train'].features['label'].num_classes
id2label = {id:dataset['train'].features['label'].int2str(id) for id in range(num_classes)}
label2id = {label:id for (id,label) in id2label.items()}

With this code I removed the label encoding from my tokenize function, i.e.

def tokenize(batch):
    tokens = tokenizer(batch["text"], truncation=True, max_length=512)
    return tokens

In both cases I initialise the model as follows:

model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=num_classes, id2label=id2label, label2id=label2id
)

In the first case I get 73% accuracy for Epoch 1 (evaluated on the test set). When I use class_encode_column I get 2%. If I train and evaluate using only the test set I get much higher accuracy. This implies to me that either it’s encoding both test and train independently or I’m doing something else wrong?

Can you please advise how to properly create a ClassLabel for a dataset that is already split? And when training a model with a ClassLabel as the target is there any additional considerations?

lhoestq · July 17, 2023, 1:22pm

Hi ! for all the splits to have the same labels you can do

dataset["train"] = dataset["train"].class_encode_column("label")
class_label_feature = dataset["train"].features["label"]

dataset["test"] = dataset["test"].cast_column("label", class_label_feature)

Topic		Replies	Views
Class Labels for Custom Datasets 🤗Datasets	4	16832	June 2, 2022
Cannot encode/tokenize my Dataset Dictionary Beginners	1	1041	August 19, 2021
Dataset label format for multi-label text classification 🤗Datasets	9	12814	February 9, 2023
Preprocessing data for text classification, HF dataset 🤗Datasets	1	563	October 3, 2022
How to create custom ClassLabels? 🤗Datasets	3	7324	January 20, 2022

Correct use of dataset.class_encode_column

Related topics