Generation configs: Unquantised vs AWQ, model weights format
Thanks for making these quants rapidly available!
It appears that your generation configs are different from those available in Deepseek's original repos. It is likely due to the fact that they added them only 11 hours ago. Do you think you might update your quants to feature the correct gen cfgs? My understanding is that the sampling parameters do not change while passing to AWQ but I may be wrong. Thank you!
Also, as an update, I've noticed that all the AWQ deepseek models seem to have bfloat16
in their config. The AWQ weights are normally stored in float16
, and inference engines like VLLM do get confused unless I pass float16
instead of auto
for dtype. Is this purely a config mistake or some downcasting is happening that may be hurting performance?
UPD: I actually managed to load it in bfloat16
in VLLM somehow, and I saw that you now allow the dtype
selection in the latest version of AWQ. This is new and a bit confusing to me, so I would appreciate a comment about this if possible!
The DeepSeek V3 model needs to load in bfloat16 instead of float16 to not error out during inferece. Therefore, I updated the model to load in it's native data type.
Ok, not exactly sure to understand (I've meant the distilled Qwen and Llama-based models) but I interpret that bfloat16
is native for all the R1-distilled quants. Thanks.