File size: 4,827 Bytes

---
license: llama3.1
train: false
inference: false
pipeline_tag: text-generation
---
This is an <a href="https://github.com/mobiusml/hqq/">HQQ</a> all 4-bit (group-size=64) quantized <a href="https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct"> Llama3.1-8B-Instruct</a> model.
We provide two versions: 
* Calibration-free version: https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq/
* Calibrated version: https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib/

![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F636b945ef575d3705149e982%2Fi0vpy66jdz3IlGQcbKqHe.png%3C%2Fspan%3E)%3C%2Fspan%3E

![image/gif](https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq/resolve/main/llama3.1_4bit.gif)


## Model Size
| Models            | fp16| HQQ 4-bit/gs-64 | <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a>|  <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"> GPTQ 4-bit </a> |
|:-------------------:|:--------:|:----------------:|:----------------:|:----------------:|
| Bitrate (Linear layers)    |   16         |  4.5 | 4.25 | 4.25 |
| VRAM (GB)                  |   15.7       |  6.1 | 6.3  | 5.7  |

## Model Decoding Speed
| Models            | fp16| HQQ 4-bit/gs-64| <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a>|   <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"> GPTQ 4-bit </a> |
|:-------------------:|:--------:|:----------------:|:----------------:|:----------------:|
| Decoding* - short seq (tokens/sec)|   53         |    <b>125</b>     |    67   | 3.7 |
| Decoding* - long  seq (tokens/sec)|   50         |    <b>97</b>      |    65   | 21  |

*: RTX 3090

## Performance

| Models            | fp16       | HQQ 4-bit/gs-64 (no calib) | HQQ 4-bit/gs-64 (calib) | <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"> AWQ 4-bit </a> | <a href="https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"> GPTQ 4-bit </a> |
|:-------------------:|:--------:|:----------------:|:----------------:|:----------------:|:----------------:|
| ARC (25-shot)      | 60.49 | 60.32 | 60.92 | 57.85 | 61.18 |
| HellaSwag (10-shot)| 80.16 | 79.21 | 79.52 | 79.28 | 77.82 |
| MMLU (5-shot)      | 68.98 | 67.07 | 67.74 | 67.14 | 67.93 |
| TruthfulQA-MC2     | 54.03 | 53.89 | 54.11 | 51.87 | 53.58 |
| Winogrande (5-shot)| 77.98 | 76.24 | 76.48 | 76.4  | 76.64 |
| GSM8K (5-shot)     | 75.44 | 71.27 | 75.36 | 73.47 | 72.25 |
| Average            | 69.51 | 68.00 | <b>69.02</b> | 67.67 | 68.23 |
| Relative performance |  100% | 97.83% | <b>99.3%</b>  | 97.35% | 98.16% |

You can reproduce the results above via `pip install lm-eval==0.4.3`

## Usage
First, install the dependecies:
```
pip install git+https://github.com/mobiusml/hqq.git #master branch fix
pip install bitblas #if you use the bitblas backend
```
Also, make sure you use at least torch `2.4.0` or the nightly build with at least CUDA 12.1. 

Then you can use the sample code below:
``` Python
import torch
from transformers import AutoTokenizer
from hqq.models.hf.base import AutoHQQHFModel
from hqq.utils.patching import *
from hqq.core.quantize import *
from hqq.utils.generation_hf import HFGenerator

#Load the model
###################################################
#model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq' #no calib version
model_id = 'mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib' #calibrated version

compute_dtype = torch.bfloat16 #bfloat16 for torchao_int4, float16 for bitblas
cache_dir = '.'
model     = AutoHQQHFModel.from_quantized(model_id, cache_dir=cache_dir, compute_dtype=compute_dtype)
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir)

quant_config = BaseQuantizeConfig(nbits=4, group_size=64, quant_scale=False, quant_zero=False, axis=1)
patch_linearlayers(model, patch_add_quant_config, quant_config)

#Use optimized inference kernels
###################################################
HQQLinear.set_backend(HQQBackend.PYTORCH)
#prepare_for_inference(model) #default backend
prepare_for_inference(model, backend="torchao_int4") 
#prepare_for_inference(model, backend="bitblas") #takes a while to init...

#Generate
###################################################
#For longer context, make sure to allocate enough cache via the cache_size= parameter 
gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Warm-up takes a while

gen.generate("Write an essay about large language models", print_tokens=True)
gen.generate("Tell me a funny joke!", print_tokens=True)
gen.generate("How to make a yummy chocolate cake?", print_tokens=True)

```