Hello,
I would like to improve the inference time of a finetuned T5-base (translation task). I am currently using the .bin
file (from_pretrained
) and a GPU. I have tried several approaches, such as ONNX and TensorRT. Having a max_length=1024
, these approaches perform worse (and sometimes by a lot). Are there any techniques that could help? Thank you.
Hey ,
Did you try quantization ?
There is an example for pegasus model here. I tried and it performed pretty well for summarization with an inference time decrease by 2x or 3x
Thanks @YannAgora. Can run with 2x or 3x on GPU?
I haven’t tried on a GPU instance but I don’t see why it wouldn’t work.
@YannAgora I got this error using the code, adding reconstructed_quantized_model.to("cuda")
for GPU inference:
NotImplementedError: Could not run 'quantized::linear_dynamic' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'quantized::linear_dynamic' is only available for these backends: [CPU, BackendSelect, Python, Named, Conjugate, Negative, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, Tracer, UNKNOWN_TENSOR_TYPE_ID, Autocast, Batched, VmapMode].
I saw that GPU is not supported here.
Oh ok I didn’t know that