|
**SqueezeLLM** is a post-training quantization framework that incorporates a new method called Dense-and-Sparse Quantization to enable efficient LLM serving. |
|
|
|
**TLDR:** Deploying LLMs is difficult due to their large memory size. This can be addressed with reduced precision quantization. |
|
But a naive method hurts performance. We address this with a new Dense-and-Sparse Quantization method. |
|
Dense-and-Sparse splits weight matrices into two components: A dense component that can be heavily quantized without affecting model performance, |
|
as well as a sparse part that preserves sensitive and outlier parts of the weight matrices With this approach, |
|
we are able to serve larger models with smaller memory footprint, the same latency, and yet higher accuracy and quality. |
|
For more details please check out our [paper](https://arxiv.org/pdf/2306.07629.pdf). |
|
|
|
|
|
## Model description |
|
|
|
4-bit quantized LLaMA 65B model using SqueezeLLM. More details can be found in the [paper](https://arxiv.org/pdf/2306.07629.pdf). |
|
|
|
* **Base Model:** [LLaMA 65B](https://arxiv.org/abs/2302.13971) |
|
* **Bitwidth:** 4-bit |
|
* **Sparsity Level:** 0.45% |
|
|
|
## Links |
|
|
|
* **Paper**: [https://arxiv.org/pdf/2306.07629.pdf](https://arxiv.org/pdf/2306.07629.pdf) |
|
* **Code**: [https://github.com/SqueezeAILab/SqueezeLLM](https://github.com/SqueezeAILab/SqueezeLLM) |
|
|
|
|
|
--- |
|
license: other |
|
--- |
|
|