Text Generation
Generate text based on a prompt.
If you are interested in a Chat Completion task, which generates a response based on a list of messages, check out the chat-completion
task.
For more details about the text-generation
task, check out its dedicated page! You will find examples and related materials.
Recommended models
- google/gemma-2-2b-it: A text-generation model trained to follow instructions.
- meta-llama/Meta-Llama-3.1-8B-Instruct: Very powerful text generation model trained to follow instructions.
- microsoft/Phi-3-mini-4k-instruct: Small yet powerful text generation model.
- Qwen/Qwen2.5-7B-Instruct: Strong text generation model to follow instructions.
Explore all available models and find the one that suits you best here.
Using the API
import requests
API_URL = "/static-proxy?url=https%3A%2F%2Fapi-inference.huggingface.co%2Fmodels%2Fgoogle%2Fgemma-2-2b-it%26quot%3B%3C%2Fspan%3E
headers = {"Authorization": "Bearer hf_***"}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
output = query({
"inputs": "Can you please let us know more details about your ",
})
To use the Python client, see huggingface_hub
’s package reference.
API specification
Request
Payload | ||
---|---|---|
inputs* | string | |
parameters | object | |
adapter_id | string | Lora adapter id |
best_of | integer | Generate best_of sequences and return the one if the highest token logprobs. |
decoder_input_details | boolean | Whether to return decoder input token logprobs and ids. |
details | boolean | Whether to return generation details. |
do_sample | boolean | Activate logits sampling. |
frequency_penalty | number | The parameter for frequency penalty. 1.0 means no penalty Penalize new tokens based on their existing frequency in the text so far, decreasing the model’s likelihood to repeat the same line verbatim. |
grammar | unknown | One of the following: |
(#1) | object | |
type* | enum | Possible values: json. |
value* | unknown | A string that represents a JSON Schema. JSON Schema is a declarative language that allows to annotate JSON documents with types and descriptions. |
(#2) | object | |
type* | enum | Possible values: regex. |
value* | string | |
max_new_tokens | integer | Maximum number of tokens to generate. |
repetition_penalty | number | The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details. |
return_full_text | boolean | Whether to prepend the prompt to the generated text |
seed | integer | Random sampling seed. |
stop | string[] | Stop generating tokens if a member of stop is generated. |
temperature | number | The value used to module the logits distribution. |
top_k | integer | The number of highest probability vocabulary tokens to keep for top-k-filtering. |
top_n_tokens | integer | The number of highest probability vocabulary tokens to keep for top-n-filtering. |
top_p | number | Top-p value for nucleus sampling. |
truncate | integer | Truncate inputs tokens to the given size. |
typical_p | number | Typical Decoding mass See Typical Decoding for Natural Language Generation for more information. |
watermark | boolean | Watermarking with A Watermark for Large Language Models. |
stream | boolean |
Some options can be configured by passing headers to the Inference API. Here are the available headers:
Headers | ||
---|---|---|
authorization | string | Authentication header in the form 'Bearer: hf_****' when hf_**** is a personal user access token with Inference API permission. You can generate one from your settings page. |
x-use-cache | boolean, default to true | There is a cache layer on the inference API to speed up requests we have already seen. Most models can use those results as they are deterministic (meaning the outputs will be the same anyway). However, if you use a nondeterministic model, you can set this parameter to prevent the caching mechanism from being used, resulting in a real new query. Read more about caching here. |
x-wait-for-model | boolean, default to false | If the model is not ready, wait for it instead of receiving 503. It limits the number of requests required to get your inference done. It is advised to only set this flag to true after receiving a 503 error, as it will limit hanging in your application to known places. Read more about model availability here. |
For more information about Inference API headers, check out the parameters guide.
Response
Output type depends on the stream
input parameter.
If stream
is false
(default), the response will be a JSON object with the following fields:
Body | ||
---|---|---|
details | object | |
best_of_sequences | object[] | |
finish_reason | enum | Possible values: length, eos_token, stop_sequence. |
generated_text | string | |
generated_tokens | integer | |
prefill | object[] | |
id | integer | |
logprob | number | |
text | string | |
seed | integer | |
tokens | object[] | |
id | integer | |
logprob | number | |
special | boolean | |
text | string | |
top_tokens | array[] | |
id | integer | |
logprob | number | |
special | boolean | |
text | string | |
finish_reason | enum | Possible values: length, eos_token, stop_sequence. |
generated_tokens | integer | |
prefill | object[] | |
id | integer | |
logprob | number | |
text | string | |
seed | integer | |
tokens | object[] | |
id | integer | |
logprob | number | |
special | boolean | |
text | string | |
top_tokens | array[] | |
id | integer | |
logprob | number | |
special | boolean | |
text | string | |
generated_text | string |
If stream
is true
, generated tokens are returned as a stream, using Server-Sent Events (SSE).
For more information about streaming, check out this guide.
Body | ||
---|---|---|
details | object | |
finish_reason | enum | Possible values: length, eos_token, stop_sequence. |
generated_tokens | integer | |
input_length | integer | |
seed | integer | |
generated_text | string | |
index | integer | |
token | object | |
id | integer | |
logprob | number | |
special | boolean | |
text | string | |
top_tokens | object[] | |
id | integer | |
logprob | number | |
special | boolean | |
text | string |