Spaces:

yusufs
/

vllm-inference

Paused

yusufs commited on Nov 27, 2024

Commit

13a5c22

1 Parent(s): 493a5f1

feat(reduce-max-num-batched-tokens): Reducing max-num-batched-tokens even the error state it want to reduce max_model_len

Files changed (2) hide show

download_model.py CHANGED Viewed

@@ -1,5 +1,6 @@
 import os
 from huggingface_hub import snapshot_download
 hf_token: str = os.getenv("HF_TOKEN")
 if hf_token is None:
@@ -14,3 +15,6 @@ snapshot_download(
     revision="89a866a7041e6ec023dd462adeca8e28dd53c83e",
     token=hf_token,
 )

 import os
 from huggingface_hub import snapshot_download
+from transformers.utils.hub import move_cache
 hf_token: str = os.getenv("HF_TOKEN")
 if hf_token is None:
     revision="89a866a7041e6ec023dd462adeca8e28dd53c83e",
     token=hf_token,
 )
+# https://github.com/huggingface/transformers/issues/20428
+move_cache()

run.sh CHANGED Viewed

@@ -15,11 +15,18 @@ printf "Running vLLM OpenAI compatible API Server at port %s\n" "7860"
 #    --gpu-memory-utilization 0.85
 python -u /app/openai_compatible_api_server.py \
     --model sail/Sailor-4B-Chat \
     --revision 89a866a7041e6ec023dd462adeca8e28dd53c83e \
     --host 0.0.0.0 \
     --port 7860 \
     --dtype half \
     --enforce-eager \
     --gpu-memory-utilization 0.85

 #    --gpu-memory-utilization 0.85
+# Reducing max-num-batched-tokens to 7536 because got this error:
+# INFO 11-27 15:32:01 model_runner.py:1077] Loading model weights took 7.4150 GB
+# INFO 11-27 15:32:09 worker.py:232] Memory profiling results: total_gpu_memory=14.58GiB initial_memory_usage=7.61GiB peak_torch_memory=9.31GiB memory_usage_post_profile=7.62GiB non_torch_memory=0.20GiB kv_cache_size=2.88GiB gpu_memory_utilization=0.85
+# INFO 11-27 15:32:10 gpu_executor.py:113] # GPU blocks: 471, # CPU blocks: 655
+# INFO 11-27 15:32:10 gpu_executor.py:117] Maximum concurrency for 32768 tokens per request: 0.23x
+# ERROR 11-27 15:32:10 engine.py:366] The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (7536). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
 python -u /app/openai_compatible_api_server.py \
     --model sail/Sailor-4B-Chat \
     --revision 89a866a7041e6ec023dd462adeca8e28dd53c83e \
     --host 0.0.0.0 \
     --port 7860 \
+    --max-num-batched-tokens 7536 \
     --dtype half \
     --enforce-eager \
     --gpu-memory-utilization 0.85