File names and splits
8 datasets showcase the diversity of splits configuration on HuggingFace. See docs: https://huggingface.co/docs/hub/datasets-file-names-and-splits.
Viewer • Updated • 4 • 35Note Basic use-case. If your dataset isn’t split into train/validation/test splits, the simplest dataset structure is to have one file: data.csv
datasets-examples/doc-splits-2
Viewer • Updated • 11 • 32Note You can name your data files after the train, test, and validation splits
datasets-examples/doc-splits-3
Viewer • Updated • 11 • 32Note If you don’t have any non-traditional splits, then you can place the split name anywhere in the data file. The only rule is that the split name must be delimited by non-word characters, like test-file.csv for example instead of testfile.csv. Supported delimiters include underscores, dashes, spaces, dots, and numbers.
datasets-examples/doc-splits-4
Viewer • Updated • 11 • 31Note You can place your data files into different directories named train, test, and validation where each directory contains the data files for that split.
datasets-examples/doc-splits-5
Viewer • Updated • 11 • 31Note There are several ways to refer to train/validation/test splits. Validation splits are sometimes called “dev”, and test splits may be referred to as “eval”. These other split names are also supported, and the following keywords are equivalent: - train, training - validation, valid, val, dev - test, testing, eval, evaluation
datasets-examples/doc-splits-6
Viewer • Updated • 8 • 30Note Splits can span several files. Make sure all the files of your train set have train in their names (same for test and validation). You can even add a prefix or suffix to train in the file name (like my_train_file_00001.csv for example).
datasets-examples/doc-splits-7
Viewer • Updated • 8 • 30Note For convenience, you can also place your data files into different directories. In this case, the split name is inferred from the directory name.
datasets-examples/doc-splits-8
Viewer • Updated • 11 • 31Note If your dataset splits have custom names that aren’t train, test, or validation, then you can name your data files like data/-xxxxx-of-xxxxx.csv.