Too slow with CPU

#2
by cmaire - opened

Hi,
I tried your model but is is so slow. What can I do ?
Thank you

Hi @cmaire ,

Since I don't have access to a GPU-enabled space, here's how to run the model locally:

Prerequisites:

  1. Request access to meta-llama/Llama-3.1-8B
  2. Generate a Hugging Face token
  3. Install Docker with NVIDIA GPU libraries on your CUDA-enabled machine

For Llama 3.1 (≈3 min/query on my poor RTX2860):

docker run --gpus=all -it -p 7860:7860 --platform=linux/amd64 \
  -e HF_TOKEN="hf_your_token" \
  registry.hf.space/eltorio-llama-3-1-8b-appreciation:latest python app.py

For faster responses, try Llama 3.2 3B (<10s/query):

  1. Request access to meta-llama/Llama-3.2-3B
  2. Run:
docker run --gpus=all -it -p 7860:7860 --platform=linux/amd64 \
  -e HF_TOKEN="hf_your_token" \
  registry.hf.space/eltorio-llama-3-2-3b-appreciation:latest python app.py

Access the interface at http://localhost:7860

Best regards,
Ronan

Thank you for your answer but my graphic card is AMD , is it ok ?

short:
No

In fact it should be possible but I never tried see https://huggingface.co/blog/huggingface-and-optimum-amd

eltorio changed discussion status to closed

Sign up or log in to comment