Hi I want to convert the GPT-j Model to ONNX to improve the inference speed. I tried to convert the model to ONNX, but it did not fit into the RAM, so I need to convert it to fp16, I tried the optimum optimizer but it says graph optimization not supported for gpt-j.
Here is the command with which I have converted it:
python -m optimum.exporters.onnx --task causal-lm-with-past --for-ort --model EleutherAI/gpt-j-6B gptj_onnx/
can anyone help in this regard!
@fxmarty can you help? I get the idea to convert it to onnx through your answer on this post:
One thing I also noted while doing that… if the model size is 5GB (eg. GPT Neo 1.3B) the convert ONNX model take up to 2.5 times the VRAM while inference… that is too high. So if I try to run GPT-j it takes 50-60GB RAM to run inference. Is there any way or I am doing something wrong.
I want to reduce the latency for GPT-j as currently it is slow even on GPU for generating 4-500 tokens!
Hi @pankajdev007 Right, this is not ideal. Currently the memory is duplicated for decoder models, as there is an ONNX that does not use the past key/values (for the first decoding iteration), and an ONNX that does use them.