Works great, much faster inference. Quantization possible?
Thank you for this! As per title, getting good results (same rank output as original model) while running much faster. Memory usage is about the same as original model. Is it possible to quantize these models to try to reduce size & memory footprint & further speed up inference?
Actually just tried it myself, turns out fairly easy to do. It seems to be able to quantize up to optimization level O3 only, not sure if it's possible or needs more tweaking to be able to quantize O4. Not bad, reduced file size further to ~1/4 of original model, lower memory footprint than ONNX-O4, even faster inference and so far output is the same!
According to the document, O4 and O3 should represent the same level of optimization—the distinction being that O4 is exclusively for GPU usage. I will subsequently examine whether the performance of O3 is indeed superior.
https://huggingface.co/docs/optimum/onnxruntime/usage_guides/optimization
Furthermore, quantification can enhance performance.