Is there any easy way to stream output to make a data pipeline with Datasets?
The use case is, I’d like to read a HF dataset, run every row through an embedding model, and save the embeddings to disk as a HF dataset.
I wrap the HF Dataset in a Torch Dataloader so I can buffer it with multiple threads and avoid data starving the GPU. Then there’s a Torch inference loop over minibatches that are run through the Sentence Transformers model on the GPU.
The best solution I can see at the moment is to rewrite ds.save_to_disk
so it runs the Torch inference loop before saving to disk. I can easily create a pyarrow.parquet.ParquetWriter
to save batches from inference to disk each iteration, but it won’t have the metadata files / convenient sharding for HF. Another method is to store all the Torch tensors in memory and join them with the HF dataset afterwards, which doesn’t work when it doesn’t fit in memory.
Is there a built-in feature or an easier method that I’m missing to accomplish this?