Load pre-existing in-memory splits into a Dataset

Hello,

For some experiments I have made, I have already divided my local private data into “train”, “test”, and “valid” and saved them into a single .json file with the following stucture:
{
“train”: [
{
“text”: “this is the first example”
“label”: 2
},
{
“text”: “this is the second example”
“label”: 1
},
],
“test”: …
“valid”: …
}

but I struggle to create a Dataset out of it. In this tutorial (in memory section):

https://huggingface.co/docs/datasets/loading_datasets.html#from-a-python-dictionary

It isn’t clear how to proceed. I looked at the source code of arrow_dataset.py for this method from_dict, but I didn’t manage to find out the solution to my problem.

Does anyone have an idea (except splitting the original file into several files and reformating them)?

Hi,

this is possible but not nearly as clean as the approach which involves converting data to JSON Lines and splitting it to have one file per split:

import datasets
ddict = datasets.DatasetDict()
for split in ["train", "test", "valid"]:
   ddict.update(datasets.load_dataset("json", data_files={split: "path/to/data/file.json"}, field=split)
1 Like

Thank you very much!