Hello everyone, I have been trying to speed up the GPT-Neo 1.3B model using Onnx, and have been facing significant issues.
I first exported the GPT-Neo 1.3B
model using the Causal-LM
feature. This created a folder with lots of files and the model.onnx file as well.
Thereafter I tried using the onnx model using onnx-runtime as shown in the this page.
Here is the code I used.
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
ONNX_PROVIDERS = ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
session = rt.InferenceSession("onnx/model.onnx", providers=ONNX_PROVIDERS)
inputs = tokenizer("Using gpt-neo with ONNX Runtime and ", return_tensors="np")
outputs = session.run(output_names=["logits"], input_feed=dict(inputs))
I used the %%time
magic in the Jupyter cell and the above code took more than 5 minutes to execute.
After that I used a longer sentence and tried to inference again but the cell never completed execution (I waited for about an hour).
%%time
inputs = tokenizer("Using gpt-neo with ONNX Runtime again and this time with many more words which will put considerable load on the GPU as well as the CPU ", return_tensors="np")
outputs = session.run(output_names=["logits"], input_feed=dict(inputs))
I seem to be missing something, as I am certain this shouldn’t take so long. Could anyone please help me?