Intuition for quality decrease after quantization

#23

by krumeto - opened Dec 13, 2023

Dec 13, 2023

My intuition has been that larger models lose relatively less quality after quantization vs. smaller models (e.g. Llama 2 70B in 4bits would be closer to the original precision model than Llama 2 7B in 4bits to its original precision model).

Do you have any insights if that intuition holds for a MoE?

If during inference only 2 of the 7B experts are active based on the above, I'd expect the quality loss after quantization to be relatively higher than, say a 45B non-MoE quantized model.

Thank you in advance!

ArthurZ

Dec 18, 2023

cc @marcsun13 who worked on the quantization!

marcsun13

Dec 18, 2023

•

edited Dec 18, 2023

Hi @krumeto , this is right. We've seen a decrease in quality loss comparable to a quantized LLama 7B.

krumeto

Dec 18, 2023

Thank you, @marcsun13 ! Since I asked the question, first Open LLM Leaderboard results for the base GPTQ version appeared. The decrease seems to be more or less similar to what we saw with Llama 2 models:

Model	Average	ARC	HellaSwag	MMLU	TruthfulQA	Winogrande	GSM8K
mistralai/Mixtral-8x7B-v0.1	68.42	66.04	86.49	71.82	46.78	81.93	57.47
TheBloke/Mixtral-8x7B-v0.1-GPTQ	65.7	65.19	84.72	69.43	45.42	81.14	48.29
Score Delta	0.960	0.987	0.980	0.967	0.971	0.990	0.840

This is great news for us (waiting for the instruct model GPTQ scores, but in general, I hope this holds). We are testing the model with TGI (in 8bit, eetq), waiting to test GPTQ (seems like there are still some TGI issues with GPTQ), but not quite sure which of the methods should retain most quality (we are less interested in speed). If you have any resources that compare Mixtral (or even other models) any of EETQ/GPTQ/AWQ/bnb in terms of quality, it would be very helpful. This blog was already extremely insightful - https://huggingface.co/blog/overview-quantization-transformers#overview-of-natively-supported-quantization-schemes-in-%F0%9F%A4%97-transformers
Thank you all!

marcsun13

Dec 18, 2023

Hi @krumeto , thanks for the awesome feedback. We are still working on the AWQ quant since the quality is not good enough for now. For bnb, the quality should be the same. As for GPTQ, the model that was tested is not the best gptq quant. You can test the following branch which should give better results: gptq-4bit-128g-actorder_True or gptq-4bit-32g-actorder_True with 32g being the most accurate one. However, the vram consumption will increase since these quant needs to store more quantization statistics (128g version with an additional 1go and 32g version with an additional 3.5 Go)

More details: https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GPTQ

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment