Generation configs: Unquantised vs AWQ, model weights format

#1
by nfunctor - opened

Thanks for making these quants rapidly available!

It appears that your generation configs are different from those available in Deepseek's original repos. It is likely due to the fact that they added them only 11 hours ago. Do you think you might update your quants to feature the correct gen cfgs? My understanding is that the sampling parameters do not change while passing to AWQ but I may be wrong. Thank you!

nfunctor changed discussion title from Generation configs: FP16 vs AWQ to Generation configs: Unquantised vs AWQ, model weights format

Also, as an update, I've noticed that all the AWQ deepseek models seem to have bfloat16 in their config. The AWQ weights are normally stored in float16, and inference engines like VLLM do get confused unless I pass float16 instead of auto for dtype. Is this purely a config mistake or some downcasting is happening that may be hurting performance?

UPD: I actually managed to load it in bfloat16 in VLLM somehow, and I saw that you now allow the dtype selection in the latest version of AWQ. This is new and a bit confusing to me, so I would appreciate a comment about this if possible!

The DeepSeek V3 model needs to load in bfloat16 instead of float16 to not error out during inferece. Therefore, I updated the model to load in it's native data type.

Ok, not exactly sure to understand (I've meant the distilled Qwen and Llama-based models) but I interpret that bfloat16 is native for all the R1-distilled quants. Thanks.

Sign up or log in to comment