Sagemaker Deployment Failing in ml.g5.2xlarge instance

#4
by rishisaraf11 - opened

I am getting the below error in Cloudwatch. We are trying to deploy it in ml.g5.2xlarge instance. Any resolution for this or we need to deploy it in bigger instance.

torch.cuda.OutOfMemoryError: Allocation on device 0 would exceed allowed memory. (out of memory)
Currently allocated : 20.61 GiB
Requested : 172.00 MiB
Device limit : 22.20 GiB
Free (according to CUDA): 15.12 MiB
PyTorch limit (set by user-supplied memory fraction)
: 22.20 GiB
The above exception was the direct cause of the following exception:

NumbersStation org

The model can be deployed on g5.xlarge with torch.bfloat16.

Thanks @senwu . Can you please tell me how to give torch.bfloat16. configuration in the deployment script. Sorry, I am new to this and don't know many of these configs. Below is the deployment script I am using

import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='AmazonSageMaker-ExecutionRole-20230723T133694')['Role']['Arn']

# Hub Model configuration. https://huggingface.co/models
hub = {
    'HF_MODEL_ID':'NumbersStation/nsql-llama-2-7B',
    'SM_NUM_GPUS': json.dumps(1)
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=get_huggingface_llm_image_uri("huggingface",version="0.9.3"),
    env=hub,
    role=role, 
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",
    container_startup_health_check_timeout=300,
)

predictor.predict({
    "inputs": "Can you please let us know more details about your ",
})```
NumbersStation org

Hi @rishisaraf11

We haven't used Sagemaker to deploy the model and from the doc it doesn't seem like there is much flexibility. The model prefers torch.bfloat16 but you can still use other dtype.

Hi @senwu

I tried different variations of passing SM_FRAMEWORK_PARAMS into env for HuggingFaceModel class in the script shared by @rishisaraf11 but no luck

hub = {
'HF_MODEL_ID': 'NumbersStation/nsql-llama-2-7B',
'SM_NUM_GPUS': json.dumps(1),
'SM_FRAMEWORK_PARAMS': "{'torch_dtype': 'bfloat16'}"
}

#create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface", version="0.9.3"),
env=hub,
role=role,
)

#deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.2xlarge",
container_startup_health_check_timeout=300,
)

NumbersStation org

It seems like sagemaker doesn't have full transformer support yet. You can use the default config for the model as well.

You can also use g5.2xlarge machine or low_cpu_mem_usage=True from https://huggingface.co/docs/transformers/main_classes/model to reduce the RAM usage when loading the model.

Thank you for the reply @senwu

Problem seems with the overflow of GPU VRAM which is ~22.2 GB's

for ml.g5.2xlarge which has Nvidia A10g 24 GB GPU.

Error: Sagemaker deployment failed due to memory error

torch.cuda.OutOfMemoryError: Allocation on device 0 would exceed allowed memory. (out of memory)
Currently allocated : 20.61 GiB
Requested : 172.00 MiB
Device limit : 22.20 GiB
Free (according to CUDA): 15.12 MiB
PyTorch limit (set by user-supplied memory fraction)
: 22.20 GiB
NumbersStation org

To torch.float32 version of the model it requires around 26G VRAM. We will adjust the default model type this week.

Sign up or log in to comment