Works great, much faster inference. Quantization possible?

by jharianto - opened Jan 1, 2024

Jan 1, 2024

Thank you for this! As per title, getting good results (same rank output as original model) while running much faster. Memory usage is about the same as original model. Is it possible to quantize these models to try to reduce size & memory footprint & further speed up inference?

jharianto

Jan 1, 2024

Actually just tried it myself, turns out fairly easy to do. It seems to be able to quantize up to optimization level O3 only, not sure if it's possible or needs more tweaking to be able to quantize O4. Not bad, reduced file size further to ~1/4 of original model, lower memory footprint than ONNX-O4, even faster inference and so far output is the same!

swulling

Owner Jan 8, 2024

According to the document, O4 and O3 should represent the same level of optimization—the distinction being that O4 is exclusively for GPU usage. I will subsequently examine whether the performance of O3 is indeed superior.

https://huggingface.co/docs/optimum/onnxruntime/usage_guides/optimization

Furthermore, quantification can enhance performance.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment