api-inference documentation

Parameters

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Parameters

Additional Options

Caching

There is a cache layer on the inference API to speed up requests when the inputs are exactly the same. Many models, such as classifiers and embedding models, can use those results as is if they are deterministic, meaning the results will be the same. However, if you use a nondeterministic model, you can disable the cache mechanism from being used, resulting in a real new query.

To do this, you can add x-use-cache:false to the request headers. For example

Python
JavaScript
cURL
import requests

API_URL = "/static-proxy?url=https%3A%2F%2Fapi-inference.huggingface.co%2Fmodels%2FMODEL_ID%26quot%3B
headers = {
    "Authorization": "Bearer hf_***",
    "Content-Type": "application/json",
+   "x-use-cache": "false"
}
data = {
    "inputs": "Can you please let us know more details about your "
}
response = requests.post(API_URL, headers=headers, json=data)
print(response.json())

Wait for the model

When a model is warm, it is ready to be used and you will get a response relatively quickly. However, some models are cold and need to be loaded before they can be used. In that case, you will get a 503 error. Rather than doing many requests until it’s loaded, you can wait for the model to be loaded by adding x-wait-for-model:true to the request headers. We suggest to only use this flag to wait for the model to be loaded when you are sure that the model is cold. That means, first try the request without this flag and only if you get a 503 error, try again with this flag.

Python
JavaScript
cURL
import requests

API_URL = "/static-proxy?url=https%3A%2F%2Fapi-inference.huggingface.co%2Fmodels%2FMODEL_ID%26quot%3B
headers = {
    "Authorization": "Bearer hf_***",
    "Content-Type": "application/json",
+   "x-wait-for-model": "true"
}
data = {
    "inputs": "Can you please let us know more details about your "
}
response = requests.post(API_URL, headers=headers, json=data)
print(response.json())
< > Update on GitHub