Optimum Inference with ONNX Runtime
Optimum is a utility package for building and running inference with accelerated runtime like ONNX Runtime. Optimum can be used to load optimized models from the Hugging Face Hub and create pipelines to run accelerated inference without rewriting your APIs.
Switching from Transformers to Optimum Inference
The Optimum Inference models are API compatible with Hugging Face Transformers models. This means you can just replace your AutoModelForXxx
class with the corresponding ORTModelForXxx
class in optimum
. For example, this is how you can use a question answering model in optimum
:
from transformers import AutoTokenizer, pipeline
-from transformers import AutoModelForQuestionAnswering
+from optimum.onnxruntime import ORTModelForQuestionAnswering
-model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2") # pytorch checkpoint
+model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2") # onnx checkpoint
tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")
onnx_qa = pipeline("question-answering",model=model,tokenizer=tokenizer)
question = "What's my name?"
context = "My name is Philipp and I live in Nuremberg."
pred = onnx_qa(question, context)
Optimum Inference also includes methods to convert vanilla Transformers models to optimized ones. Simply pass from_transformers=True
to the from_pretrained()
method, and your model will be loaded and converted to ONNX on-the-fly:
>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForSequenceClassification
# Load the model from the hub and export it to the ONNX format
>>> model = ORTModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english", from_transformers=True)
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
# Create a pipeline
>>> onnx_classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
>>> result = onnx_classifier("This is a great model")
[{'label': 'POSITIVE', 'score': 0.9998838901519775}]
Working with the Hugging Face Model Hub
The Optimum model classes like ORTModelForSequenceClassification are integrated with the Hugging Face Model Hub, which means you can not only
load model from the Hub, but also push your models to the Hub with push_to_hub()
method. Below is an example which downloads a vanilla Transformers model
from the Hub and converts it to an optimum onnxruntime model and pushes it back into a new repository.
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForSequenceClassification
# Load the model from the hub and export it to the ONNX format
>>> model = ORTModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english", from_transformers=True)
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
# Save the converted model
>>> model.save_pretrained("a_local_path_for_convert_onnx_model")
>>> tokenizer.save_pretrained("a_local_path_for_convert_onnx_model")
# Push the onnx model to HF Hub
>>> model.push_to_hub("a_local_path_for_convert_onnx_model", repository_id="my-onnx-repo", use_auth_token=True)
Export and inference of sequence-to-sequence models
Sequence-to-sequence (Seq2Seq) models, that generate a new sequence from an input, can also be used when running inference with ONNX Runtime. When Seq2Seq models are exported to the ONNX format, they are decomposed into three parts that are later combined during inference. Those three parts consist of the encoder, the “decoder” (which actually consists of the decoder with the language modeling head), and the “decoder” with pre-computed key/values as additional inputs. This specific export comes from the fact that during the first pass, the decoder has no pre-computed key/values hidden-states, while during the rest of the generation past key/values will be used to speed up sequential decoding. Here is an example on how you can export a T5 model to the ONNX format and run inference for a translation task:
>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForSeq2SeqLM
# Load the model from the hub and export it to the ONNX format
>>> model = ORTModelForSeq2SeqLM.from_pretrained("t5-small", from_transformers=True)
>>> tokenizer = AutoTokenizer.from_pretrained("t5-small")
# Create a pipeline
>>> onnx_translation = pipeline("translation_en_to_fr", model=model, tokenizer=tokenizer)
>>> result = onnx_translation("My name is Eustache")
[{'translation_text': 'Mon nom est Eustache'}]
ORTModel
Base ORTModel class for implementing models using ONNX Runtime. The ORTModel implements generic methods for interacting
with the Hugging Face Hub as well as exporting vanilla transformers models to ONNX using transformers.onnx
toolchain.
The ORTModel implements additionally generic methods for optimizing and quantizing Onnx models.
load_model
< source >( path: typing.Union[str, pathlib.Path] provider = None )
Loads an ONNX Inference session with a given provider. Default provider is CPUExecutionProvider
to match the default behaviour in PyTorch/TensorFlow/JAX.
Changes the ONNX Runtime provider according to the device.
ORTModelForFeatureExtraction
class optimum.onnxruntime.ORTModelForFeatureExtraction
< source >( model = None config = None **kwargs )
Parameters
-
config (
transformers.PretrainedConfig
) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out thefrom_pretrained()
method to load the model weights. -
model (
onnxruntime.InferenceSession
) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information.
Onnx Model with a MaskedLMOutput for feature-extraction tasks.
This model inherits from [~onnxruntime.modeling_ort.ORTModel
]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving)
Feature Extraction model for ONNX.
forward
< source >( input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None token_type_ids: typing.Optional[torch.Tensor] = None **kwargs )
Parameters
-
input_ids (
torch.Tensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Indices can be obtained usingAutoTokenizer
. SeePreTrainedTokenizer.encode
andPreTrainedTokenizer.__call__
for details. What are input IDs? -
attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked. What are attention masks?
-
token_type_ids (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in[0, 1]
:- 1 for tokens that are sentence A,
- 0 for tokens that are sentence B. What are token type IDs?
The ORTModelForFeatureExtraction
forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example of feature extraction:
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForFeatureExtraction
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/all-MiniLM-L6-v2")
>>> model = ORTModelForFeatureExtraction.from_pretrained("optimum/all-MiniLM-L6-v2")
>>> inputs = tokenizer("My name is Philipp and I live in Germany.", return_tensors="pt")
>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> list(logits.shape)
Example using transformers.pipeline
:
>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForFeatureExtraction
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/all-MiniLM-L6-v2")
>>> model = ORTModelForFeatureExtraction.from_pretrained("optimum/all-MiniLM-L6-v2")
>>> onnx_extractor = pipeline("feature-extraction", model=model, tokenizer=tokenizer)
>>> text = "My name is Philipp and I live in Germany."
>>> pred = onnx_extractor(text)
ORTModelForQuestionAnswering
class optimum.onnxruntime.ORTModelForQuestionAnswering
< source >( model = None config = None **kwargs )
Parameters
-
config (
transformers.PretrainedConfig
) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out thefrom_pretrained()
method to load the model weights. -
model (
onnxruntime.InferenceSession
) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information.
Onnx Model with a QuestionAnsweringModelOutput for extractive question-answering tasks like SQuAD.
This model inherits from [~onnxruntime.modeling_ort.ORTModel
]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving)
Question Answering model for ONNX.
forward
< source >( input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None token_type_ids: typing.Optional[torch.Tensor] = None **kwargs )
Parameters
-
input_ids (
torch.Tensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Indices can be obtained usingAutoTokenizer
. SeePreTrainedTokenizer.encode
andPreTrainedTokenizer.__call__
for details. What are input IDs? -
attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked. What are attention masks?
-
token_type_ids (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in[0, 1]
:- 1 for tokens that are sentence A,
- 0 for tokens that are sentence B. What are token type IDs?
The ORTModelForQuestionAnswering
forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example of question answering:
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForQuestionAnswering
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/roberta-base-squad2")
>>> model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2")
>>> question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
>>> inputs = tokenizer(question, text, return_tensors="pt")
>>> start_positions = torch.tensor([1])
>>> end_positions = torch.tensor([3])
>>> outputs = model(**inputs, start_positions=start_positions, end_positions=end_positions)
>>> start_scores = outputs.start_logits
>>> end_scores = outputs.end_logits
Example using transformers.pipeline
:
>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForQuestionAnswering
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/roberta-base-squad2")
>>> model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2")
>>> onnx_qa = pipeline("question-answering", model=model, tokenizer=tokenizer)
>>> question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
>>> pred = onnx_qa(question, text)
ORTModelForSequenceClassification
class optimum.onnxruntime.ORTModelForSequenceClassification
< source >( model = None config = None **kwargs )
Parameters
-
config (
transformers.PretrainedConfig
) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out thefrom_pretrained()
method to load the model weights. -
model (
onnxruntime.InferenceSession
) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information.
Onnx Model with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks.
This model inherits from [~onnxruntime.modeling_ort.ORTModel
]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving)
Sequence Classification model for ONNX.
forward
< source >( input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None token_type_ids: typing.Optional[torch.Tensor] = None **kwargs )
Parameters
-
input_ids (
torch.Tensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Indices can be obtained usingAutoTokenizer
. SeePreTrainedTokenizer.encode
andPreTrainedTokenizer.__call__
for details. What are input IDs? -
attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked. What are attention masks?
-
token_type_ids (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in[0, 1]
:- 1 for tokens that are sentence A,
- 0 for tokens that are sentence B. What are token type IDs?
The ORTModelForSequenceClassification
forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example of single-label classification:
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForSequenceClassification
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/distilbert-base-uncased-finetuned-sst-2-english")
>>> model = ORTModelForSequenceClassification.from_pretrained("optimum/distilbert-base-uncased-finetuned-sst-2-english")
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> list(logits.shape)
Example using transformers.pipelines
:
>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForSequenceClassification
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/distilbert-base-uncased-finetuned-sst-2-english")
>>> model = ORTModelForSequenceClassification.from_pretrained("optimum/distilbert-base-uncased-finetuned-sst-2-english")
>>> onnx_classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
>>> text = "Hello, my dog is cute"
>>> pred = onnx_classifier(text)
Example using zero-shot-classification transformers.pipelines
:
>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForSequenceClassification
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/distilbert-base-uncased-mnli")
>>> model = ORTModelForSequenceClassification.from_pretrained("optimum/distilbert-base-uncased-mnli")
>>> onnx_z0 = pipeline("zero-shot-classification", model=model, tokenizer=tokenizer)
>>> sequence_to_classify = "Who are you voting for in 2020?"
>>> candidate_labels = ["Europe", "public health", "politics", "elections"]
>>> pred = onnx_z0(sequence_to_classify, candidate_labels, multi_class=True)
ORTModelForTokenClassification
class optimum.onnxruntime.ORTModelForTokenClassification
< source >( model = None config = None **kwargs )
Parameters
-
config (
transformers.PretrainedConfig
) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out thefrom_pretrained()
method to load the model weights. -
model (
onnxruntime.InferenceSession
) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information.
Onnx Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.
This model inherits from [~onnxruntime.modeling_ort.ORTModel
]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving)
Token Classification model for ONNX.
forward
< source >( input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None token_type_ids: typing.Optional[torch.Tensor] = None **kwargs )
Parameters
-
input_ids (
torch.Tensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Indices can be obtained usingAutoTokenizer
. SeePreTrainedTokenizer.encode
andPreTrainedTokenizer.__call__
for details. What are input IDs? -
attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked. What are attention masks?
-
token_type_ids (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in[0, 1]
:- 1 for tokens that are sentence A,
- 0 for tokens that are sentence B. What are token type IDs?
The ORTModelForTokenClassification
forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example of token classification:
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForTokenClassification
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/bert-base-NER")
>>> model = ORTModelForTokenClassification.from_pretrained("optimum/bert-base-NER")
>>> inputs = tokenizer("My name is Philipp and I live in Germany.", return_tensors="pt")
>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> list(logits.shape)
Example using transformers.pipelines
:
>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForTokenClassification
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/bert-base-NER")
>>> model = ORTModelForTokenClassification.from_pretrained("optimum/bert-base-NER")
>>> onnx_ner = pipeline("token-classification", model=model, tokenizer=tokenizer)
>>> text = "My name is Philipp and I live in Germany."
>>> pred = onnx_ner(text)
ORTModelForCausalLM
class optimum.onnxruntime.ORTModelForCausalLM
< source >( model = None config = None **kwargs )
Parameters
-
config (
transformers.PretrainedConfig
) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out thefrom_pretrained()
method to load the model weights. -
model (
onnxruntime.InferenceSession
) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information.
Onnx Model with a causal language modeling head on top (linear layer with weights tied to the input embeddings).
This model inherits from [~onnxruntime.modeling_ort.ORTModel
]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving)
Causal LM model for ONNX.
forward
< source >( input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None **kwargs )
Parameters
-
input_ids (
torch.Tensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Indices can be obtained usingAutoTokenizer
. SeePreTrainedTokenizer.encode
andPreTrainedTokenizer.__call__
for details. What are input IDs? -
attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked. What are attention masks?
-
token_type_ids (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in[0, 1]
:- 1 for tokens that are sentence A,
- 0 for tokens that are sentence B. What are token type IDs?
The ORTModelForCausalLM
forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example of text generation:
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForCausalLM
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/gpt2")
>>> model = ORTModelForCausalLM.from_pretrained("optimum/gpt2")
>>> inputs = tokenizer("My name is Philipp and I live in Germany.", return_tensors="pt")
>>> gen_tokens = model.generate(**inputs,do_sample=True,temperature=0.9, min_length=20,max_length=20)
>>> tokenizer.batch_decode(gen_tokens)
Example using transformers.pipelines
:
>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/gpt2")
>>> model = ORTModelForCausalLM.from_pretrained("optimum/gpt2")
>>> onnx_gen = pipeline("text-generation", model=model, tokenizer=tokenizer)
>>> text = "My name is Philipp and I live in Germany."
>>> gen = onnx_gen(text)
ORTModelForSeq2SeqLM
Sequence-to-sequence model with a language modeling head for ONNX Runtime inference.
forward
< source >( input_ids: LongTensor = None attention_mask: typing.Optional[torch.FloatTensor] = None decoder_input_ids: typing.Optional[torch.LongTensor] = None encoder_outputs: typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = None **kwargs )
Parameters
-
input_ids (
torch.LongTensor
) — Indices of input sequence tokens in the vocabulary of shape(batch_size, encoder_sequence_length)
. -
attention_mask (
torch.LongTensor
) — Mask to avoid performing attention on padding token indices, of shape(batch_size, encoder_sequence_length)
. Mask values selected in[0, 1]
. -
decoder_input_ids (
torch.LongTensor
) — Indices of decoder input sequence tokens in the vocabulary of shape(batch_size, decoder_sequence_length)
. -
encoder_outputs (
torch.FloatTensor
) — The encoderlast_hidden_state
of shape(batch_size, encoder_sequence_length, hidden_size)
. -
past_key_values (
tuple(tuple(torch.FloatTensor), *optional*)
— Contains the precomputed key and value hidden states of the attention blocks used to speed up decoding. The tuple is of lengthconfig.n_layers
with each tuple having 2 tensors of shape(batch_size, num_heads, decoder_sequence_length, embed_size_per_head)
and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.
The ORTModelForSeq2SeqLM
forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example of text generation:
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForSeq2SeqLM
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/t5-small")
>>> model = ORTModelForSeq2SeqLM.from_pretrained("optimum/t5-small")
>>> inputs = tokenizer("My name is Eustache and I like to", return_tensors="pt")
>>> gen_tokens = model.generate(**inputs)
>>> outputs = tokenizer.batch_decode(gen_tokens)
Example using transformers.pipeline
:
>>> from transformers import AutoTokenizer, pipeline
>>> from optimum.onnxruntime import ORTModelForSeq2SeqLM
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/t5-small")
>>> model = ORTModelForSeq2SeqLM.from_pretrained("optimum/t5-small")
>>> onnx_translation = pipeline("translation_en_to_de", model=model, tokenizer=tokenizer)
>>> text = "My name is Eustache."
>>> pred = onnx_translation(text)
ORTModelForImageClassification
class optimum.onnxruntime.ORTModelForImageClassification
< source >( model = None config = None **kwargs )
Parameters
-
config (
transformers.PretrainedConfig
) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out thefrom_pretrained()
method to load the model weights. -
model (
onnxruntime.InferenceSession
) — onnxruntime.InferenceSession is the main class used to run a model. Check out the load_model() method for more information.
Onnx Model for image-classification tasks.
This model inherits from [~onnxruntime.modeling_ort.ORTModel
]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving)
Image Classification model for ONNX.
forward
< source >( pixel_values: Tensor **kwargs )
Parameters
-
pixel_values (
torch.Tensor
of shape(batch_size, num_channels, height, width)
) — Pixel values corresponding to the images in the current batch. Pixel values can be obtained from encoded images usingAutoFeatureExtractor
.
The ORTModelForImageClassification
forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example of image classification:
>>> import requests
>>> from PIL import Image
>>> from optimum.onnxruntime import ORTModelForImageClassification
>>> from transformers import AutoFeatureExtractor
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> preprocessor = AutoFeatureExtractor.from_pretrained("optimum/vit-base-patch16-224")
>>> model = ORTModelForImageClassification.from_pretrained("optimum/vit-base-patch16-224")
>>> inputs = preprocessor(images=image, return_tensors="pt")
>>> outputs = model(**inputs)
>>> logits = outputs.logits
Example using transformers.pipeline
:
>>> import requests
>>> from PIL import Image
>>> from transformers import AutoFeatureExtractor, pipeline
>>> from optimum.onnxruntime import ORTModelForImageClassification
>>> preprocessor = AutoFeatureExtractor.from_pretrained("optimum/vit-base-patch16-224")
>>> model = ORTModelForImageClassification.from_pretrained("optimum/vit-base-patch16-224")
>>> onnx_image_classifier = pipeline("image-classification", model=model, feature_extractor=preprocessor)
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> pred = onnx_image_classifier(url)