Convert GPT-j to FP-16 Onnx

pankajdev007 · January 24, 2023, 12:54pm

Hi I want to convert the GPT-j Model to ONNX to improve the inference speed. I tried to convert the model to ONNX, but it did not fit into the RAM, so I need to convert it to fp16, I tried the optimum optimizer but it says graph optimization not supported for gpt-j.

Here is the command with which I have converted it:
python -m optimum.exporters.onnx --task causal-lm-with-past --for-ort --model EleutherAI/gpt-j-6B gptj_onnx/

can anyone help in this regard!

@fxmarty can you help? I get the idea to convert it to onnx through your answer on this post:

any help would be great push!

fxmarty · January 26, 2023, 3:19pm

@pankajdev007 Thanks for trying out! I’ll have a look shortly!

pankajdev007 · January 27, 2023, 9:32am

One thing I also noted while doing that… if the model size is 5GB (eg. GPT Neo 1.3B) the convert ONNX model take up to 2.5 times the VRAM while inference… that is too high. So if I try to run GPT-j it takes 50-60GB RAM to run inference. Is there any way or I am doing something wrong.

I want to reduce the latency for GPT-j as currently it is slow even on GPU for generating 4-500 tokens!

fxmarty · February 6, 2023, 2:39pm

Hi @pankajdev007 Right, this is not ideal. Currently the memory is duplicated for decoder models, as there is an ONNX that does not use the past key/values (for the first decoding iteration), and an ONNX that does use them.

This PR should be merged soon and fix the issue: Enable inference with a merged decoder in `ORTModelForCausalLM` by JingyaHuang · Pull Request #647 · huggingface/optimum · GitHub

Additionally, I added a support to export directly models in float16, passing --fp16 --device cuda to the ONNX exporter: Support ONNX export on `torch.float16` type by fxmarty · Pull Request #749 · huggingface/optimum · GitHub

Hopefully we will soon have a release that include these two PRs!

silvacarl · March 10, 2023, 11:21pm

were you able to get this to work?

Topic		Replies	Views
Reducing latency for GPT-J Beginners	9	2409	December 18, 2022
Problem with onnx export and usage Beginners	0	430	June 25, 2022
Questions about ONNX 🤗Transformers	4	3032	January 25, 2022
Improving decoding speed by onnx conversion model Beginners	0	239	November 17, 2021
Some questions about GPT-J inference using int8 🤗Transformers	3	1414	January 24, 2023

Convert GPT-j to FP-16 Onnx

Related topics