Can inference be done with the model converted to OpenVINO?

#44
by DeltaLux - opened

For me, using inference with the model loaded without conversion is slow on CPU so I'm trying to convert the model to OpenVINO which appears to convert correctly and infer quickly after compilation, but I can't get the output to decode. I mark with an emoji and text each step that is successful or fails in my notebook.

[βœ… Success] Install Python Libraries

%pip install --upgrade pip
%pip install -q "openvino>=2023.2.0"
%pip install -q torch
%pip install transformers
%pip install einops
%pip install accelerate

[βœ… Success] Clone repository

git lfs install 
git clone https://huggingface.co/microsoft/phi-2

[βœ… Success] Convert the model

import openvino as ov
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
core = ov.Core()
# Load the model
model = AutoModelForCausalLM.from_pretrained("./phi-2", torch_dtype=torch.float32, device_map="cpu", trust_remote_code=True)
# Create a tokenizer to tokenize a sample for conversion of the model
tokenizer = AutoTokenizer.from_pretrained("./phi-2", return_dict=True)
inputs = tokenizer('Write a detailed analogy between mathematics and a lighthouse.', return_tensors="pt", return_attention_mask=False)
# Convert the model
openvino_model = ov.convert_model(input_model=model, example_input=dict(inputs), verbose=True)
# Save the model
ov.save_model(openvino_model, "./openvino-phi2/openvino-phi2.xml")

[βœ… Success] Load the converted model and print inputs and outputs (Optional - this is to see the input and output shapes and names)

import openvino as ov
from transformers import AutoTokenizer

openvino_model_id = "./openvino-phi2/openvino-phi2.xml"
core = ov.Core()
# Change GPU.1 with the string of your device. Could be same or could change depending on whether you have a dedicated GPU or integrated GPU for inference
compiled_model = core.compile_model(model=openvino_model_id, device_name="GPU.1")

model_inputs = compiled_model.inputs
model_input = compiled_model.input(0)
model_outputs = compiled_model.outputs
print("Model inputs count:", len(model_inputs))
print("Model input:", model_input)
print("Model outputs count:", len(model_outputs))
print("Model outputs:")
for output in model_outputs:
    print("  ", output)

Output:

Model inputs count: 1
Model input: <ConstOutput: names[input_ids] shape[?,?] type: i64>
Model outputs count: 1
Model outputs:
<ConstOutput: names[5195, 5196, logits] shape[?,?,51200] type: f32>

[❌ Fail] Load OpenVINO model and try to run inference

import openvino as ov
from transformers import AutoTokenizer

openvino_model_id = "./openvino-phi2/openvino-phi2.xml"
original_model_id = "./phi-2"
core = ov.Core()
# Change GPU.1 with the string of your device. Could be same or could change depending on whether you have a dedicated GPU or integrated GPU for inference
compiled_model = core.compile_model(model=openvino_model_id, device_name="GPU.1")
tokenizer = AutoTokenizer.from_pretrained(original_model_id)
inputs = tokenizer('Write a detailed analogy between mathematics and a lighthouse.', return_tensors="np", return_attention_mask=False)
result = compiled_model.infer_new_request(inputs={"input_ids": inputs["input_ids"]})
text = tokenizer.batch_decode([result])[0]
print(text)

Output. Caused by text = tokenizer.batch_decode([result])[0]

TypeError: argument 'ids': 'openvino._pyopenvino.ConstOutput' object cannot be interpreted as an integer

Are there other steps needed to decode the output or do I need another tokenizer?

@DeltaLux Were you able to decode the output? I would also like to use openvino for inference.

@DeltaLux Were you able to decode the output? I would also like to use openvino for inference.

No success. I tried creating various ONNX configurations to see if I could convert the model to ONNX instead, but nothing I got is working. When converted to OpenVino format the resulting model outputs an OpenVINO Tensor: shape[1, 10, 51200] type: f32. The type is the result of a cast operation to f32 at the end of the NN. The original output using the HF library is a Torch tensor shape[1, 20] type:torch.int64 unless one changes the max_length property which would give a different shape with the number of elements specified on the property. I haven't tried creating a different tokenizer with its own encoder and decoder yet.

We might have to wait for an ONNX configuration to be added to the HF optimum library. I tried to see if I could contribute it myself, yet my test configurations have not created a working ONNX model.

Sign up or log in to comment