CUDA out of memory on A100 with 40GB
Hi, I tried to run the code example from this blogpost: https://huggingface.co/blog/idefics2 on A100 Colab but failed at generated_ids = model.generate(**inputs, max_new_tokens=500)
. Is there any way to optimize the inference?
hi
@sagnak
, you could deactivate the image splitting do save some memory -> processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)
and since you are using an A100, you should definitely load in bf16:
model = AutoModelForVision2Seq.from_pretrained(
"HuggingFaceM4/idefics2-8b",
torch_dtype=torch.bfloat16,
).to(DEVICE)
I assume you can also load the weights in 4-bit:
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForVision2Seq.from_pretrained(
"HuggingFaceM4/idefics2-8b",
quantization_config=quantization_config,
device_map="auto",
)
See the blog post for more details: https://huggingface.co/blog/4bit-transformers-bitsandbytes. Btw @SkalskiP there's a chat version coming soon, this model has only undergone supervised fine-tuning (SFT), so it can't be compared directly to llava for instance (unless you want to compute metrics on multimodal benchmarks)
I implemented the above optimizations, but I still get OOM on A6000 GPU with 48 GB VRAM:
import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
from transformers import BitsAndBytesConfig
DEVICE = "cuda:0"
# Note that passing the image urls (instead of the actual pil images) to the processor is also possible
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg")
image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg")
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)
model = AutoModelForVision2Seq.from_pretrained(
"HuggingFaceM4/idefics2-8b",
torch_dtype=torch.bfloat16,
quantization_config=quantization_config,
)
# Create inputs
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What do we see in this image?"},
]
},
{
"role": "assistant",
"content": [
{"type": "text", "text": "In this image, we can see the city of New York, and more specifically the Statue of Liberty."},
]
},
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "And how about this image?"},
]
},
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts)
Error:
OutOfMemoryError: CUDA out of memory. Tried to allocate 874.00 MiB. GPU 0 has a total capacity of 47.30 GiB of which 861.44 MiB is free. Including non-PyTorch memory, this process has 46.44 GiB memory in use. Of the allocated memory 45.59 GiB is allocated by PyTorch, and 356.39 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
hi @starzmustdie , i just used your exact code snippet (modulo the pip installs) on a 16GB V100 and it's spiking at 9.5GB GPU memory... can you say more about your setup?
@starzmustdie
et al,
i updated the section https://huggingface.co/HuggingFaceM4/idefics2-8b#model-optimizations to include some benchmarks on how to run idefics2 with very little GPU memory.
TLDR: there are plenty of low-lift setups that require less than 16GB GPU memory to run inference.