Dataset.from_dict() killed

omgwenxx · July 8, 2024, 11:40am

I am trying to convert a dataset into Arrow Format according to this tutorial here but process is being killed (the console literally says KILLED). The dataset I am using is the FACAD dataset that includes fashion images and captions in tokenized format and is provided in a HDF5 format.

This is the code that I use for converting:

import json
import os
import h5py
import pandas as pd
from datasets import Dataset, DatasetDict

Dataset.cleanup_cache_files

OUTFOLDER = "/data_processed/fic"
splits = ["train"]
split_dict = {}

for split in splits:
    split = split.upper()
    h = h5py.File(os.path.join(data_folder, split + '_IMAGES' + '.hdf5'), 'r')
    images = h['images']

    with open(os.path.join(data_folder, split + '_CAPTIONS_RAW' + '.json'), 'r') as j:
        texts = json.load(j)
    
    filenames = list(range(len(images)))

    ds = Dataset.from_dict({"image": images, "text": texts, "filename": filenames})
    split = split.lower()
    split_dict[split] = ds
    
dataset = DatasetDict(split_dict)
dataset.save_to_disk(OUTFOLDER)

Previously, I converted the captions in to a raw string format and saved them into a {split}_CAPTIONS_RAW.json file. The images are provided in RBG and 256x256 size (not normalized).

I assume it has to do with the sample size of the train set being 888293 samples, because for validation and the test set I was able to create the arrow files.

Any help is appreciated.

Topic		Replies	Views
ArrowTypeError in load_dataset 🤗Datasets	1	571	June 12, 2023
Load pre-existing in-memory splits into a Dataset 🤗Datasets	2	998	November 16, 2021
Data format for text-to-image 🧨 Diffusers	3	1720	January 14, 2023
Datasets with custom python objects 🤗Datasets	3	300	April 4, 2024
Convert json dataset to "datasets.arrow_dataset.Dataset" type 🤗Datasets	0	202	May 15, 2024

Dataset.from_dict() killed

Related topics