Using sparsely encoded tensors can significantly reduce the size of a dataset on disk. It can also improve IO. Currently, huggingface dataset does not seem to support sparse encoding of tensors, although Apache Arrow seems to support it (Tensors — Apache Arrow v14.0.1). Are there plans to integrate this in the future? Could someone point me to resources on how to implement this functionality into datasets myself? Cheers!
The Arrow Tensor API is not part of the Arrow format specification: even if it’s surprising, you can’t store Arrow Tensors in Arrow files.
On the other hand there exist extension types for tensors in the Arrow format specification, e.g. the arrow.fixed_shape_tensor
, see Canonical Extension Types — Apache Arrow v14.0.1. There is no Sparce tensor extension though (yet ?).
In the meantime you can store the data you need to get a Sparse tensor in Arrow directly (i.e. the values, the indices and the tensor format)