eltorio/Llama-3.1-8B-appreciation

cmaire

Nov 27, 2024

Hi,
I tried your model but is is so slow. What can I do ?
Thank you

eltorio

Owner Nov 27, 2024

Hi @cmaire ,

Since I don't have access to a GPU-enabled space, here's how to run the model locally:

Prerequisites:

Request access to meta-llama/Llama-3.1-8B
Generate a Hugging Face token
Install Docker with NVIDIA GPU libraries on your CUDA-enabled machine

For Llama 3.1 (≈3 min/query on my poor RTX2860):

docker run --gpus=all -it -p 7860:7860 --platform=linux/amd64 \
  -e HF_TOKEN="hf_your_token" \
  registry.hf.space/eltorio-llama-3-1-8b-appreciation:latest python app.py

For faster responses, try Llama 3.2 3B (<10s/query):

Request access to meta-llama/Llama-3.2-3B
Run:

docker run --gpus=all -it -p 7860:7860 --platform=linux/amd64 \
  -e HF_TOKEN="hf_your_token" \
  registry.hf.space/eltorio-llama-3-2-3b-appreciation:latest python app.py

Access the interface at http://localhost:7860

Best regards,
Ronan

cmaire

Nov 27, 2024

Thank you for your answer but my graphic card is AMD , is it ok ?

eltorio

Owner Nov 27, 2024

short:
No

In fact it should be possible but I never tried see https://huggingface.co/blog/huggingface-and-optimum-amd

eltorio changed discussion status to closed Nov 27, 2024

Spaces:

eltorio
/

Llama-3.1-8B-appreciation

Runtime error

Too slow with CPU