Papers
arxiv:2411.02355

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

Published on Nov 4, 2024
· Submitted by ekurtic on Nov 5, 2024
#2 Paper of the day

Abstract

Despite the popularity of large language model (LLM) quantization for inference acceleration, significant uncertainty remains regarding the accuracy-performance trade-offs associated with various quantization formats. We present a comprehensive empirical study of quantized accuracy, evaluating popular quantization formats (FP8, INT8, INT4) across academic benchmarks and real-world tasks, on the entire Llama-3.1 model family. Additionally, our study examines the difference in text generated by quantized models versus their uncompressed counterparts. Beyond benchmarks, we also present a couple of quantization improvements which allowed us to obtain state-of-the-art accuracy recovery results. Our investigation, encompassing over 500,000 individual evaluations, yields several key findings: (1) FP8 weight and activation quantization (W8A8-FP) is lossless across all model scales, (2) INT8 weight and activation quantization (W8A8-INT), when properly tuned, incurs surprisingly low 1-3% accuracy degradation, and (3) INT4 weight-only quantization (W4A16-INT) is competitive with 8-bit integer weight and activation quantization. To address the question of the "best" format for a given deployment environment, we conduct inference performance analysis using the popular open-source vLLM framework on various GPU architectures. We find that W4A16 offers the best cost-efficiency for synchronous deployments, and for asynchronous deployment on mid-tier GPUs. At the same time, W8A8 formats excel in asynchronous "continuous batching" deployment of mid- and large-size models on high-end GPUs. Our results provide a set of practical guidelines for deploying quantized LLMs across scales and performance requirements.

Community

Paper author Paper submitter

vLLM + Quantization: We investigated impact of quantization across all Llama sizes to come up with a set of practical guidelines for deployment across various use cases and GPU architectures in vLLM. The paper contains some interesting findings relative to "well-known" things.

For table 5 (and similar for table 6), it doesn't make much sense to add the latency and cost for various different tasks, since the inference itself is always the same, just with different data. Instead you could have one column with an average of latency and cost, and add accuracy / performance scores for the different tasks or sth like that.
for example (this is what i would like to see at least):
avg first token latency (64 and 1024 input tokens), avg tokens/second throughput, tokens/$ (cost), accuracy task 1, accuracy task 2, ...

·
Paper author

Thanks for the comment, our goal with that table was to illustrate the differences that happen across tasks with varying number of prompt and decode tokens that match those use cases. For example, summarization is very prefill heavy with thousands of tokens, but generally only a few hundred tokens for decode which is in contrast to something like a single chat use case with only a few hundred tokens for both prefill and decode. Ultimately, these differences change the compute profiles from ones dominated by compute to ones dominated by memory movement, so it is important to fully categorize the performance differences across various general use cases.

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2411.02355 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2411.02355 in a Space README.md to link it from this page.

Collections including this paper 12