Why are model_q4.onnx and model_q4f16.onnx not 4 times smaller than model.onnx?
I see on https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct/tree/main/onnx:
File Name | Size |
---|---|
model.onnx | 654 MB |
model_fp16.onnx | 327 MB |
model_q4.onnx | 200 MB |
model_q4f16.onnx | 134 MB |
I understand that:
model.onnx
is the fp32 model,model_fp16.onnx
is the model whose weights are quantized tofp16
I don't understand the size of model_q4.onnx
and model_q4f16.onnx
Why is
model_q4.onnx
200 MB instead of 654 MB / 4 = 163.5 MB? I thoughtmodel_q4.onnx
meant that the weights are quantized to 4 bits.Why is
model_q4f16.onnx
134 MB instead of 654 MB / 4 = 163.5 MB? I thoughtmodel_q4f16.onnx
meant that the weights are quantized to 4 bits and activations are fp16, since https://llm.mlc.ai/docs/compilation/configure_quantization.html states:qAfB(_id)
, whereA
represents the number of bits for storing weights andB
represents the number of bits for storing activations.
and Why do activations need more bits (16bit) than weights (8bit) in tensor flow's neural network quantization framework? indicates that activations don't count toward the model size (understandably).