Parameters
Additional Options
Caching
There is a cache layer on the inference API to speed up requests when the inputs are exactly the same. Many models, such as classifiers and embedding models, can use those results as is if they are deterministic, meaning the results will be the same. However, if you use a nondeterministic model, you can disable the cache mechanism from being used, resulting in a real new query.
To do this, you can add x-use-cache:false
to the request headers. For example
import requests
API_URL = "/static-proxy?url=https%3A%2F%2Fapi-inference.huggingface.co%2Fmodels%2FMODEL_ID%26quot%3B
headers = {
"Authorization": "Bearer hf_***",
"Content-Type": "application/json",
+ "x-use-cache": "false"
}
data = {
"inputs": "Can you please let us know more details about your "
}
response = requests.post(API_URL, headers=headers, json=data)
print(response.json())
Wait for the model
When a model is warm, it is ready to be used and you will get a response relatively quickly. However, some models are cold and need to be loaded before they can be used. In that case, you will get a 503 error. Rather than doing many requests until it’s loaded, you can wait for the model to be loaded by adding x-wait-for-model:true
to the request headers. We suggest to only use this flag to wait for the model to be loaded when you are sure that the model is cold. That means, first try the request without this flag and only if you get a 503 error, try again with this flag.
import requests
API_URL = "/static-proxy?url=https%3A%2F%2Fapi-inference.huggingface.co%2Fmodels%2FMODEL_ID%26quot%3B
headers = {
"Authorization": "Bearer hf_***",
"Content-Type": "application/json",
+ "x-wait-for-model": "true"
}
data = {
"inputs": "Can you please let us know more details about your "
}
response = requests.post(API_URL, headers=headers, json=data)
print(response.json())