Trying to convert LlaMa weights to HF and running out of RAM, but don't want to buy more RAM?

#4
by daryl149 - opened

To convert the original llama 30B weights to HF format, it needs to fit the entire model into RAM (PC, not GPU VRAM, see https://huggingface.co/docs/transformers/main/model_doc/llama).
That's more than 65GB of RAM. Chances are as a hobbyist you don't have that in your PC. If you:

  • run python src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir /path/to/LLaMA --output_dir /path/to/converted_llama_hf --model_size 30B
  • see your RAM go up to 100% in the activity monitor,
  • your PC freezes for a bit,
  • and then you see killed in the terminal,

that's your cue.
Instead of buying and upgrading the total RAM in your system for this one time, you can instead increase the swap file of your PC to offload objects that are larger than your RAM onto your SSD/HDD. So this method does require you have at least a spare 100GB of disk space available. (More likely than having 128GB RAM sitting in your PC.)

Here's how to increase swap for Ubuntu:

#check your current swap size
free -h
#turn off your current swap
sudo swapoff -a
#increase swap to 100GB to be able to offload the entire model from RAM to disk
sudo fallocate -l 100G /swapfile
#make sure swapfile permissions are set, then activate
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
#check new swap size (should say something like 97Gi)
free -h

Congrats, you can now run python src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir /path/to/LLaMA --output_dir /path/to/converted_llama_hf --model_size 30B without buying more RAM.
It should be done in an hour or so, which is slower than usual due to swapping between RAM and disk, but still very manageable for a one time operation.
(If you'd like, you can decrease your swap partition and free up disk space again.)

daryl149 changed discussion title from Trying to convert the LlaMa weight to HF and running out of RAM, but don't want to buy more RAM? to Trying to convert LlaMa weights to HF and running out of RAM, but don't want to buy more RAM?

please @daryl149 , To run this command

python src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir /path/to/LLaMA --output_dir /path/to/converted_llama_hf --model_size 30B

how to know input_dir, output_dir exactly in my system, or should I create them ?

The input_dir is the folder called LLaMA where you saved the original llama weights from meta.
The output dir can be anywhere you want to store it.

@daryl149 Even if you managed to convert it, would you be able to run the model for inference with your RAM?

@Mlemoyne
Yes! For inference, PC RAM usage is not a bottleneck. You have these options:

  • if you have a combined GPU VRAM of at least 40GB, you can run it in 8-bit mode (35GB to host the model and 5 in reserve for inference). The PC RAM usage is negligible (<10GB).
  • if you have less than 40GB VRAM, then you can use the offload_folder='offload' parameter in your model call. It will offload every layer that does not fit in GPU VRAM to a folder you create called offload on your SSD/HDD. It will be slow, but it will run (2 minutes inference time or so)
  • if you have no notable GPU VRAM to speak of, you can convert the weights to ggml and run it on CPU. See this thread: https://huggingface.co/OpenAssistant/oasst-sft-6-llama-30b-xor/discussions/2.

Can't we directly use the converted model present --> https://huggingface.co/decapoda-research/llama-30b-hf and then run the xor_codec.py on it?
Also, for inference, what are you guys using? I saw a similar inference-related thread --> https://huggingface.co/OpenAssistant/oasst-sft-6-llama-30b-xor/discussions/5 I am simply getting killed just by loading the model even after adding offload_folder='offload'
It would be really helpful if anyone could tell.

Can't we directly use the converted model present --> https://huggingface.co/decapoda-research/llama-30b-hf and then run the xor_codec.py on it?
Also, for inference, what are you guys using? I saw a similar inference-related thread --> https://huggingface.co/OpenAssistant/oasst-sft-6-llama-30b-xor/discussions/5 I am simply getting killed just by loading the model even after adding offload_folder='offload'
It would be really helpful if anyone could tell.

Apparently not, see: https://huggingface.co/OpenAssistant/oasst-sft-6-llama-30b-xor/discussions/15#644911e8e988635a3d6be312
I'm running a system with an RTX A6000, so that fits most model weights in 8-bit without offloading.
What's your VRAM, RAM and swap size? Are you running out of all 3? If so, try to still increase the swap size combined with the offload folder.

If using WSL on Windows you can alternatively create a ".wslconfig" file in your C:\Users%USERNAME%\ directory with the following (replace 55GB with something a little less than your total RAM):

[wsl2]
memory=55GB
swap=100GB

can we use the model to finetune it on our specific dataset , like how other models hosted on hugging face is done @daryl149 please let me know whats the actual process.

Can't we directly use the converted model present --> https://huggingface.co/decapoda-research/llama-30b-hf and then run the xor_codec.py on it?
Also, for inference, what are you guys using? I saw a similar inference-related thread --> https://huggingface.co/OpenAssistant/oasst-sft-6-llama-30b-xor/discussions/5 I am simply getting killed just by loading the model even after adding offload_folder='offload'
It would be really helpful if anyone could tell.

Apparently not, see: https://huggingface.co/OpenAssistant/oasst-sft-6-llama-30b-xor/discussions/15#644911e8e988635a3d6be312
I'm running a system with an RTX A6000, so that fits most model weights in 8-bit without offloading.
What's your VRAM, RAM and swap size? Are you running out of all 3? If so, try to still increase the swap size combined with the offload folder.

I have 24GB of A10G GPU and 32GB Ram, I have yet not set up the swap size, and hence it is 0. I think I was trying everything with https://huggingface.co/decapoda-research/llama-30b-hf, and hence things might not be working. I will try with the original Llama weights once. I will have to use both the method you suggested:

  1. Transforming the model to HF compatible (as there is less RAM)
  2. Even with 8-bit, I will have to do offloading (48GB GPU would have been good, haha).
    Thanks for the response!

@MatthewK I am on Linux only so setting up swap size directly should be straight forward, thanks though!

can we use the model to finetune it on our specific dataset , like how other models hosted on hugging face is done @daryl149 please let me know whats the actual process.

Have not tried, but also this is a question you can ask in a new topic.

@daryl149 Sorry to bother you again, but all my checksums after the convert_to_hf are the same except for the following:

8bc8ad3b8256b780d0917e72d257c176 ./tokenizer.json

Was this the same case for you? If not, any idea what might have happened? Because of the rest of the things. match perfectly.

After this, I used xor-codec.py, and none of the checksums matches now :(
If possible, can you please share your tokenizer.json? I think because of that, the whole thing got messed up..

@daryl149 Sorry to bother you again, but all my checksums after the convert_to_hf are the same except for the following:

8bc8ad3b8256b780d0917e72d257c176 ./tokenizer.json

Was this the same case for you? If not, any idea what might have happened? Because of the rest of the things. match perfectly.

After this, I used xor-codec.py, and none of the checksums matches now :(
If possible, can you please share your tokenizer.json? I think because of that, the whole thing got messed up..

I'd first try to call the model and see if it works for you anyway. Maybe you used a windows edited version of the file, which replaces the CRLF with CR, see https://huggingface.co/OpenAssistant/oasst-sft-6-llama-30b-xor/discussions/1#64444613c63001ae6355dda7. Or maybe you opened it manually once and it converted some line breaks automatically.

Probably best to open a new topic for your issue or redownload the file from where you got it. Can't share any files, due to not wanting to incur the wrath of meta.

@daryl149 Sorry to bother you again, but all my checksums after the convert_to_hf are the same except for the following:

8bc8ad3b8256b780d0917e72d257c176 ./tokenizer.json

Was this the same case for you? If not, any idea what might have happened? Because of the rest of the things. match perfectly.

After this, I used xor-codec.py, and none of the checksums matches now :(
If possible, can you please share your tokenizer.json? I think because of that, the whole thing got messed up..

I'd first try to call the model and see if it works for you anyway. Maybe you used a windows edited version of the file, which replaces the CRLF with CR, see https://huggingface.co/OpenAssistant/oasst-sft-6-llama-30b-xor/discussions/1#64444613c63001ae6355dda7. Or maybe you opened it manually once and it converted some line breaks automatically.

Probably best to open a new topic for your issue or redownload the file from where you got it. Can't share any files, due to not wanting to incur the wrath of meta.

I completely understand, it started working for me. I was in the wrong virtual env, and that just messed up one file somehow. Either way thanks a lot for your help and to this thread. Wouldn't have been able to run the model without it :)

want do you want from any help I don't know programmer and I networking and security

Sign up or log in to comment