Datasets - Streaming Output to Arrow?

stefanwebb · October 21, 2024, 4:41pm

Is there any easy way to stream output to make a data pipeline with Datasets?

The use case is, I’d like to read a HF dataset, run every row through an embedding model, and save the embeddings to disk as a HF dataset.

I wrap the HF Dataset in a Torch Dataloader so I can buffer it with multiple threads and avoid data starving the GPU. Then there’s a Torch inference loop over minibatches that are run through the Sentence Transformers model on the GPU.

The best solution I can see at the moment is to rewrite ds.save_to_disk so it runs the Torch inference loop before saving to disk. I can easily create a pyarrow.parquet.ParquetWriter to save batches from inference to disk each iteration, but it won’t have the metadata files / convenient sharding for HF. Another method is to store all the Torch tensors in memory and join them with the HF dataset afterwards, which doesn’t work when it doesn’t fit in memory.

Is there a built-in feature or an easier method that I’m missing to accomplish this?

lhoestq · October 23, 2024, 2:26pm

Hi ! you can use Dataset.from_generator()

def pipeline():
    for batch in dataloader:
        for output_example in f(batch):
            yield output_example

ds = Dataset.from_generator(pipeline)
# ds.save_to_disk(...)
# or
# ds.push_to_hub(...)

stefanwebb · October 23, 2024, 8:42pm

Awesome, thanks @lhoestq! That did turn out to be simple…

system · October 24, 2024, 8:42am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Local dataset loading performance: HF's arrow vs torch.load 🤗Datasets	5	1054	November 24, 2024
Roadmap/timeline for dataset streaming 🤗Datasets	9	2256	July 5, 2021
Streaming dataset and cache 🤗Datasets	5	3120	August 4, 2023
batched I/O from disk when load_dataset API is used? 🤗Datasets	2	10	January 27, 2025
What's the best way to speed up inference on a large dataset? Beginners	3	3759	March 13, 2022

Datasets - Streaming Output to Arrow?

Related topics