Saving train/val/test datasets

Hi everyone.

After creating a dataset consisting of all my data, I split it in train/validation/test sets. Following that, I am performing a number of preprocessing steps on all of them, and end up with three altered datasets, of type datasets.arrow_dataset.Dataset.

In order to save them and in the future load directly the preprocessed datasets, would I have to call

dataset.save_to_disk(FILE_PATH)

3 times, one for the training, one for the validation and one for the test set? Or is there any way to somehow save them all together? If yes, what is more efficient?

Thanks in advance.

Hi !

You can save them all as a dataset dictionary:

from datasets import DatasetDict, load_from_disk

dataset = DatasetDict({
    "train": train_dataset,
    "validation": validation_dataset,
    "test": test_dataset,
})

dataset.save_to_disk("path/to/dataset/dir")

# reload
dataset = load_from_disk("path/to/dataset/dir")

# access any split
train_dataset = dataset["train"]

This is especially useful to save several splits of a dataset together.

2 Likes

This is exactly what I was looking for!

Thanks a lot :slight_smile: