SageMaker Endpoint error during inference
Tried model inference using ml.g5.24xlarge SageMaker endpoint. Getting the below mentioned error
An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
"code": 400,
"type": "InternalServerException",
"message": "[Errno 28] No space left on device"
}```
Looks to be an issue with too little memory on the instance. What size EBS volume did you attach?
As per the AWS Docs, the EBS volume size is 3800 GB.
Type | CPU | Memory | GPUS | GPU Memory | Storage
ml.g5.24xlarge | 96 | 384 GB | 4 | 96 GB |1x3800
Here I also tried with a larger instance type ml.g5.48xlarge which has double the specs
Adding more details about the error on sagemaker endpoint.
Caused by: java.io.IOException: No space left on device
pool-2-thread-6 ERROR An exception occurred processing Appender access_log org.apache.logging.log4j.core.appender.AppenderLoggingException: Error writing to stream logs/access_log.log
2023-06-01 10:33:43,006 pool-2-thread-6 ERROR An exception occurred processing Appender access_log org.apache.logging.log4j.core.appender.AppenderLoggingException: Error writing to stream logs/access_log.log
The volume you mentioned is typically mounted to /tmp/
while there is a separate volume mounted to /opt/ml/checkpoints
which you specify when launching the instance. I believe what is happening is that the model is downloaded under /opt/ml/checkpoints
which then get's exhausted. Assuming you use the HF estimator, could you try specifying volume_size = 200
?
These instance types ml.g5.24xlarge and ml.g5.48xlarge do not support the volume_size parameter as they have a 3800 GB volume with the inference endpoint. If instance type is an issue can you suggest an appropriate one which can run the model without issues. I was able to run falcon-7B-instruct without any issues...
At this point I unfortunately do not understand sagemaker endpoints with huggingface models well enough to be able to assist you, the issue is definitely related to the disk space though, as the error indicates[Errno 28] No space left on device"
. The 7B might work because it fits in the standard 30GB EBS volume. Over the coming weeks we hope to be able to provide easier ways to deploy the models.
The transformers library downloads the model on the default cache location: ~/.cache/huggingface/hub
However, the EBS volume is mounted on /home/ec2-user/SageMaker
You can check by running df
on a terminal.
You can change the transformers cache location to a directory by running this before importing the transformers library:
import os
os.environ['TRANSFORMERS_CACHE'] = '/home/ec2-user/SageMaker/transformers-cache/'
Here's the relevant documentation: https://huggingface.co/docs/transformers/v4.29.1/en/installation#cache-setup
I am no longer facing the "storage space" issue. Seems it got resolved using the below snipped in inference.py
model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-40b-instruct", trust_remote_code=True,load_in_8bit=False, torch_dtype=torch.bfloat16, device_map="auto", cache_dir="/tmp/model_cache/")
The model got deployed on the SG endpoint. However when I look at the instance metrics, the GPU Memory usage balloons to 788% and all the 8 GPUs are utilized on ml.g5.48xlarge (8 NVIDIA A10G GPUs and 192 GiB GPU memory) . We do not get an inference output as often it times out if there's no response for a minute. Is this machine enough for the model inference hosting. Should we wait longer for response.
Any other ml instance type I can try?
ml.g5.12xlarge instance was enough for me to deploy Falcon-40B in the HF TGI DLC
For running on SageMaker we would recommend having a look at this blogpost: https://www.philschmid.de/sagemaker-falcon-llm
@FalconLLM Yes I looked at that and we have already deployed your model.. Thanks for the help. I have posted this link for others as well in this community