Load pre-existing in-memory splits into a Dataset

bercher · November 15, 2021, 2:12pm

Hello,

For some experiments I have made, I have already divided my local private data into “train”, “test”, and “valid” and saved them into a single .json file with the following stucture:
{
“train”: [
{
“text”: “this is the first example”
“label”: 2
},
{
“text”: “this is the second example”
“label”: 1
},
],
“test”: …
“valid”: …
}

but I struggle to create a Dataset out of it. In this tutorial (in memory section):

https://huggingface.co/docs/datasets/loading_datasets.html#from-a-python-dictionary

It isn’t clear how to proceed. I looked at the source code of arrow_dataset.py for this method from_dict, but I didn’t manage to find out the solution to my problem.

Does anyone have an idea (except splitting the original file into several files and reformating them)?

mariosasko · November 15, 2021, 4:03pm

Hi,

this is possible but not nearly as clean as the approach which involves converting data to JSON Lines and splitting it to have one file per split:

import datasets
ddict = datasets.DatasetDict()
for split in ["train", "test", "valid"]:
   ddict.update(datasets.load_dataset("json", data_files={split: "path/to/data/file.json"}, field=split)

bercher · November 16, 2021, 9:30am

Thank you very much!

Topic		Replies	Views
How to use load_dataset to load a json file with all three splits? 🤗Datasets	2	8648	April 13, 2023
Saving train/val/test datasets 🤗Datasets	2	3283	August 25, 2021
Loading an imagenet-style image dataset with train/val directories 🤗Datasets	4	1672	August 12, 2022
Split DataFrame into validation and train split 🤗Datasets	2	6138	April 11, 2022
Confusion in splitting dataset (from imagefolder) into train, test and validation 🤗Datasets	2	5536	August 12, 2022

Load pre-existing in-memory splits into a Dataset

Related topics