Hello,
I am having trouble with the ClassLabel features for Token Classification. I am working via Pandas data frame for my dataset. And I am loading the data frame with the dataset. I cannot see the 9 custom IOB labels inside ClassLabel.
df = pd.DataFrame(df)
dataset = Dataset.from_pandas(df)
dataset = dataset.train_test_split(test_size=0.1)
Output:
DatasetDict({
train: Dataset({
features: ['tokens', 'labels', 'id'],
num_rows: 10000
})
test: Dataset({
features: ['tokens', 'labels', 'id'],
num_rows: 1000
})
})
Output:
{'tokens': Value(dtype='string', id=None),
**'labels': Value(dtype='string', id=None),**
'id': Value(dtype='int64', id=None)}
I already tried the “cast” method → dataset.cast_column(“labels”
here … ClassLabel Error · Issue #5737 · huggingface/datasets · GitHub
And the “new_features” in the package reference.
here… Main classes
@mariosasko ?
Thank you guys!
Try this:
df = pd.DataFrame(df)
dataset = Dataset.from_pandas(df)
dataset = dataset.class_encode_column("labels")
dataset = dataset.train_test_split(test_size=0.1)
1 Like
Thank you for your help @mariosasko
It’s correctly creating a ClassLabel(names=
Unfortunately, it’s appending all the labels on each row…
Column df[“labels”] with many rows as follow [‘O,O,B-DRUG,O,B-HOSPITAL,O’]
With encode_column, my results is a very long ClassLabel(names= [‘O,O,B-DRUG,O,B-HOSPITAL,O,B-HOSPITAL,O,B-DATE,O,O,O’]
Maybe the labels shouldn’t be a Pandas Series?
Oh, I missed the “Token Classification” part.
Then this should work:
df = pd.DataFrame(df)
dataset = Dataset.from_pandas(df)
dataset = dataset.map(lambda ex: {"labels": ex["labels"].split(",")})
def get_label_list(labels):
# copied from https://github.com/huggingface/transformers/blob/66fd3a8d626a32989f4569260db32785c6cbf42a/examples/pytorch/token-classification/run_ner.py#L320
unique_labels = set()
for label in labels:
unique_labels = unique_labels | set(label)
label_list = list(unique_labels)
label_list.sort()
return label_list
all_labels = get_label_list(dataset["labels"])
dataset = dataset.cast_column("labels", datasets.Sequence(datasets.ClassLable(names=all_labels)))
dataset = dataset.train_test_split(test_size=0.1)
Perfect! Thank you @mariosasko
Have a good evening!
Sorry to bother you again @mariosasko
I am trying to create a “ner_tags” column with integers corresponding to “labels” to follow the HF tutorial on Token Classification.
I tried that…
tags = dataset.features[f"labels"].feature.names
print(tags)
def create_tag_names(batch):
return {"ner_tags_str": [tags.str2int(idx) for idx in batch["labels"]]}
dataset = dataset.map(create_tag_names)
Any ideas? Thx!
I am trying to create a “ner_tags” column with integers corresponding to “labels” to follow the HF tutorial on Token Classification.
Can you provide a link to the tutorial? We already store class labels as integers, so I’m not sure I understand what you want to do.
Oops, sorry!
The tutorial is working with the column ner_tags made of numbers mapping to the corresponding labels…
Thx!
You can rename the labels
column to ner_tags
to have the same structure:
dataset = dataset.rename_column("labels", "ner_tags")
And apply the rest of the processing.
@mariosasko Thank you for your help!
I found the solution to my stupid bug on “Tokens”. I transformed my sentences in “Tokens” into a list of tokens by adding the following line:
dataset = dataset.map(lambda ex: {“tokens”: ex[“tokens”].split(“,”)})
Thx again! : )