I am playing with LLMs and have the usual two columns “input_ids” and “labels” in a Dataset object. The dataset has been created from a DataFrame object.
dataset = Dataset.from_pandas(df)
Then I encode two columns of this dataframe using a Tokenizer.
I would like to compute the maximum length of the encodings.
First way:
max_source_len = max(len(x) for x in train_dataset["input_ids"])
max_target_len = max(len(x) for x in train_dataset["labels"])
print(max_source_len)
print(max_target_len)
This takes approx. 17 seconds.
I have tested also variations of the previous code, but the problem seems to be in the transformation of the columns to list
objects. I have also tried to force the dataset to stay in memory, but, frankly, I did not understand very well the documentation of this (it seems possible only when a dataset is loaded with load_dataset).
If I transform the dataset back into a pandas DataFrame:
df2 = dataset.to_pandas()
print(df2.input_ids.apply(len).max())
print(df2.labels.apply(len).max())
the two values are printed in less than 2secs.
How am I supposed to apply operations on the columns of a dataset?