Intuition for quality decrease after quantization

#23
by krumeto - opened

My intuition has been that larger models lose relatively less quality after quantization vs. smaller models (e.g. Llama 2 70B in 4bits would be closer to the original precision model than Llama 2 7B in 4bits to its original precision model).

Do you have any insights if that intuition holds for a MoE?

If during inference only 2 of the 7B experts are active based on the above, I'd expect the quality loss after quantization to be relatively higher than, say a 45B non-MoE quantized model.

Thank you in advance!

cc @marcsun13 who worked on the quantization!

Hi @krumeto , this is right. We've seen a decrease in quality loss comparable to a quantized LLama 7B.

Thank you, @marcsun13 ! Since I asked the question, first Open LLM Leaderboard results for the base GPTQ version appeared. The decrease seems to be more or less similar to what we saw with Llama 2 models:

Model Average ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K
mistralai/Mixtral-8x7B-v0.1 68.42 66.04 86.49 71.82 46.78 81.93 57.47
TheBloke/Mixtral-8x7B-v0.1-GPTQ 65.7 65.19 84.72 69.43 45.42 81.14 48.29
Score Delta 0.960 0.987 0.980 0.967 0.971 0.990 0.840

This is great news for us (waiting for the instruct model GPTQ scores, but in general, I hope this holds). We are testing the model with TGI (in 8bit, eetq), waiting to test GPTQ (seems like there are still some TGI issues with GPTQ), but not quite sure which of the methods should retain most quality (we are less interested in speed). If you have any resources that compare Mixtral (or even other models) any of EETQ/GPTQ/AWQ/bnb in terms of quality, it would be very helpful. This blog was already extremely insightful - https://huggingface.co/blog/overview-quantization-transformers#overview-of-natively-supported-quantization-schemes-in-%F0%9F%A4%97-transformers
Thank you all!

Hi @krumeto , thanks for the awesome feedback. We are still working on the AWQ quant since the quality is not good enough for now. For bnb, the quality should be the same. As for GPTQ, the model that was tested is not the best gptq quant. You can test the following branch which should give better results: gptq-4bit-128g-actorder_True or gptq-4bit-32g-actorder_True with 32g being the most accurate one. However, the vram consumption will increase since these quant needs to store more quantization statistics (128g version with an additional 1go and 32g version with an additional 3.5 Go)
Screenshot 2023-12-18 at 3.26.50 PM.png

More details: https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GPTQ

Sign up or log in to comment