Inference with RTX 3090 got OOM

#89
by kathylee - opened

Hi dear community, I try to run inference with Gemma-7b-it(downloaded to local) via transformers.pipeline with a local RTX 3090. To test, I run a for loop of inference, the first few run(18 runs) smoothly and quite fast response, but then it got OOM:

"ERROR CUDA out of memory. Tried to allocate 128.00 MiB. GPU 0 has a total capacity of 23.68 GiB of which 172.06 MiB is free. Including non-PyTorch memory, this process has 23.09 GiB memory in use. Of the allocated memory 22.61 GiB is allocated by PyTorch, and 178.51 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)"

My code is as following(referring to https://github.com/huggingface/blog/blob/main/gemma.md):
Before the for loop:

     model= pipeline("text-generation",model="./models/gemma-7b-it",model_kwargs={"torch_dtype": torch.bfloat16},device="cuda",)

In the for loop:

    messages = [{"role": "user", "content": text},]
    prompt = model.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    start_time = monotonic()
    outputs = model(prompt, max_new_tokens=max_tokens, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)

I tried to set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True as the error message suggests but it doesn't work. And this error is reproducible. Can someone tell me please: Is this model able to run on RTX 3090? Is there anything wrong in my code? Thanks a lot!

I found same error with 30gb ram and one T4. After several inference it arised that same error message. No solution found so far besides adding more GPU to my machine (It runs with 2 T4 but I think this requirement is too much for a moderate size like 4-bits quantized 7b model). I tried inference with bigger models and I didnt have this inconvenience
like mistral 7 x 8b @suryabhupa could you shed some light?

Hello! I'm a lot less familiar with running the models on PyTorch + GPUs unfortunately (internally, we use Jax + TPUs exclusively). I am indeed surprised that so much memory is being allocated on just forward passes; are the params being casted correctly and no other memory for backward passes are being allocated (e.g. all the activation buffers are deleted as they're not needed)?

Hi @suryabhupa , thank you for your reply! What do you mean by "are the params being casted correctly and no other memory for backward passes are being allocated (e.g. all the activation buffers are deleted as they're not needed)?" exactly? I think the above code is just simply running forward pass for inference without backward pass, not doing any fine tune. I'm not sure if there is additional cleaning buffer is needed, please point out if it is, thanks a lot!

Sign up or log in to comment