Spaces:
Paused
Paused
metadata
title: Deploy VLLM
emoji: 🐢
colorFrom: blue
colorTo: blue
sdk: docker
pinned: false
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
poetry export -f requirements.txt --output requirements.txt --without-hashes
- The
HUGGING_FACE_HUB_TOKEN
andHF_TOKEN
must exist during runtime (use the same value, it must have read permission to the model.)
VLLM OpenAI Compatible API Server
References: https://huggingface.co/spaces/sofianhw/ai/tree/c6527a750644a849b6705bb6fe2fcea4e54a8196
This api_server.py
file is exact copy version from https://github.com/vllm-project/vllm/blob/v0.6.4.post1/vllm/entrypoints/openai/api_server.py
Changes (use diff tool to see the exact changes of the file):
- change everything route in api_server.py that start (“/v1/xxx”) to (“/api/v1/xxx”). and just run the python api_server.py with arguments. /static-proxy?url=https%3A%2F%2Fdiscuss.huggingface.co%2Ft%2Frun-vllm-docker-on-space%2F70228%2F5%3Fu%3Dyusufs%3C%2Fa%3E%3C%2Fli%3E
Documentation about config
"serve,chat,complete",
"facebook/opt-12B",
'--config', 'config.yaml',
'-tp', '2'
The yaml is equivalent with argument flag params. Consider passing using flag params that defined here for better documentation: https://github.com/vllm-project/vllm/blob/v0.6.4.post1/vllm/entrypoints/openai/cli_args.py#L77-L237
Other arguments is the same as LLM class such as --max-model-len
, --dtype
, or --otlp-traces-endpoint