OOM During Finetuning on A100 GPU
Hi Noelia,
Thank you very much for sharing this tool! I'm looking forward to applying it to my own projects.
I'm trying to finetune your model on a set of ~2000 sequences. However, even using a 40Gb A100 GPU results in an OOM error, with a batch size of 1.
python run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file asyn_train.txt --validation_file asyn_val.txt --tokenizer_name nferruz/ProtGPT2 --do_train --do_eval --output_dir finetune --learning_rate 1e-06 --per_device_train_batch_size 1 --low_cpu_mem_usage True
RuntimeError: CUDA error: an illegal memory access was encountered
Any help you can provide would be greatly appreciated. Thanks again!
Jonathan
Hi Noelia,
I was actually able to fix this issue. For those using a HPC to run finetuning, increasing --ntasks resolved the OOM error.
Thanks!
Jonathan
fantastic, happy to hear !