I can’t understand how to go from huggingface-cli or git clone to load_dataset() from that cached location. Despite the 3 methods all here: Downloading datasets they seem incompatible. The only poor suggestion here was to never use the other 2 methods: python - How to load a huggingface dataset from local path? - Stack Overflow
So idea is I want to use hf_transfer since that is about 10x faster for downloads than load_dataset. So I do:
pip install hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1
huggingface-cli download --repo-type dataset vietgpt/the_pile_openwebtext2
I get much faster results than load_dataset:
num_proc_load_dataset=32
dataset = load_dataset('vietgpt/the_pile_openwebtext2', num_proc=num_proc_load_dataset, trust_remote_code=True)
But, then I spent about 2 hours trying to figure out how to load the dang result using load_dataset by playing with cache_dir, data_dir, data_files, dataset name, etc. I tried linking, copying, all sorts of files everywhere and no luck.
I can see all the files in:
/home/fsuser/.cache/huggingface/hub/datasets--vietgpt--the_pile_openwebtext2/snapshots/1de27c660aefd991700e5c13865a041591394834/data
as same ones in the original repo as if I would do git clone
to that same hash folder.
But when using load_dataset, seems to go to totally different location, and on separate computer where I did that, all different file names. Very confusing.
I was able to specify some hacked setup where it stopped complaining about the files, but then asked for 1 example of a feature or something related, which I didn’t have and shouldn’t need.
Generally, suppose one uses huggingface-cli download or git clone, how does one go directly from that do load_dataset? It must be possible since the original files are exactly like that in the repo that is the origin.
Thanks!
Jon