Update README.md
Browse files
README.md
CHANGED
@@ -62,6 +62,10 @@ datasets:
|
|
62 |
|
63 |
This repo contains 8 Bit quantized GPTQ model files for [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct).
|
64 |
|
|
|
|
|
|
|
|
|
65 |
<!-- description end -->
|
66 |
|
67 |
## GPTQ Quantization Method
|
@@ -74,6 +78,8 @@ This repo contains 8 Bit quantized GPTQ model files for [meta-llama/Meta-Llama-3
|
|
74 |
| More variants to come | TBD | TBD | TBD | TBD | TBD | TBD | TBD | TBD | May upload additional variants of GPTQ 8 bit models in the future using different parameters such as 128g group size and etc. |
|
75 |
|
76 |
## Serving this GPTQ model using vLLM
|
|
|
|
|
77 |
Tested with the below command
|
78 |
```
|
79 |
python -m vllm.entrypoints.openai.api_server --model Llama-3-8B-Instruct-GPTQ-8-Bit --port 8123 --max-model-len 8192 --dtype float16
|
|
|
62 |
|
63 |
This repo contains 8 Bit quantized GPTQ model files for [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct).
|
64 |
|
65 |
+
This model can be loaded with just over 10GB of VRAM and can be served lightning fast with the cheapest Nvidia GPUs possible (Nvidia T4, Nvidia K80, RTX 4070, etc).
|
66 |
+
|
67 |
+
The 8 bit GPTQ quant has minimum quality degradation from the original `bfloat16` model due to its higher bitrate.
|
68 |
+
|
69 |
<!-- description end -->
|
70 |
|
71 |
## GPTQ Quantization Method
|
|
|
78 |
| More variants to come | TBD | TBD | TBD | TBD | TBD | TBD | TBD | TBD | May upload additional variants of GPTQ 8 bit models in the future using different parameters such as 128g group size and etc. |
|
79 |
|
80 |
## Serving this GPTQ model using vLLM
|
81 |
+
Tested serving this model via vLLM using an Nvidia T4 (16GB VRAM).
|
82 |
+
|
83 |
Tested with the below command
|
84 |
```
|
85 |
python -m vllm.entrypoints.openai.api_server --model Llama-3-8B-Instruct-GPTQ-8-Bit --port 8123 --max-model-len 8192 --dtype float16
|