vLLM on A100s
#19
by
fsaudm
- opened
Has anyone successfully served this model with vLLM?? I am trying to load and serve it on a cluster, where I have access to 7 nodes with 2 A100s. No OOM issues, but don't have docker. I was:
- Starting a Ray cluster with a head node, all other nodes as workers, all with the same comida env. Nothing fancy, just ray start ...
- vllm serve ...
--tensor-parallel-size 2 # no. of GPUs per node
--pipeline-parallel-size 6 # no. of nodes
, and then getting all sorts of errors. Any insights???
Also, since A100s can't do fp8, I was setting --dtype bfloat16, --quantization none, etc. But not sure if this is enough, and haven't seen any bf16 version uploaded..
must have 40 pcs a100 with 40gb gpu ram each (20pcs if gpus have 80gb ram each) ,total 1600gb gpu ram,bf16 is double size , also soft that will run model in paralel gpus
i never used vLLM ,only llama.cpp with other models,this model v3 is not supported by llama.cpp