T5 inference performance

GenV · March 8, 2022, 9:53am

Hello,
I would like to improve the inference time of a finetuned T5-base (translation task). I am currently using the .bin file (from_pretrained) and a GPU. I have tried several approaches, such as ONNX and TensorRT. Having a max_length=1024, these approaches perform worse (and sometimes by a lot). Are there any techniques that could help? Thank you.

YannAgora · March 8, 2022, 1:31pm

Hey ,

Did you try quantization ?

There is an example for pegasus model here. I tried and it performed pretty well for summarization with an inference time decrease by 2x or 3x

GenV · March 8, 2022, 1:53pm

Thanks @YannAgora. Can run with 2x or 3x on GPU?

YannAgora · March 8, 2022, 2:05pm

I haven’t tried on a GPU instance but I don’t see why it wouldn’t work.

GenV · March 8, 2022, 2:28pm

@YannAgora I got this error using the code, adding reconstructed_quantized_model.to("cuda") for GPU inference:

NotImplementedError: Could not run 'quantized::linear_dynamic' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'quantized::linear_dynamic' is only available for these backends: [CPU, BackendSelect, Python, Named, Conjugate, Negative, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, Tracer, UNKNOWN_TENSOR_TYPE_ID, Autocast, Batched, VmapMode].

I saw that GPU is not supported here.

YannAgora · March 8, 2022, 2:41pm

Oh ok I didn’t know that

Topic		Replies	Views
How to parallelize inference on a quantized model Intermediate	5	118	October 7, 2024
Boost inference speed of T5 models up to 5X & reduce the model size by 3X 🤗Transformers	2	5479	June 8, 2023
Baffling performance issue on most NVidia GPUs with simple transformers + pytorch code Intermediate	5	4297	April 9, 2024
Fast CPU Inference On Pegasus-Large Finetuned Model -- Currently Impossible? Beginners	4	2510	March 1, 2021
Model quantization Models	5	2482	February 15, 2023

T5 inference performance

Related topics