max_length not working?
The reply always seems to be under 70 characters. Even when setting a higher max_length. Any ideas?
ex reply:
"Life is a journey, a path we must take.
To find our way, we"
from langchain import PromptTemplate, HuggingFaceHub, LLMChain
from dotenv import load_dotenv
load_dotenv()
template = """Question: {question}
Answer: Let's think step by step."""
prompt = PromptTemplate(template=template, input_variables=["question"])
llm = HuggingFaceHub(repo_id="tiiuae/falcon-7b-instruct", model_kwargs={"temperature":0.1, "max_length":2000,})
llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "Write a poem about life"
print(question)
print('➡️ ', llm_chain.run(question))
I have the same issue, at least in combination with LangChain the mdel tends to only ouput a few Tokens and than just stops in the middle of the sentence.
Would be nice to know if we are just doing something wrong or its just the way this model works?
I found our mistake.
@domid10
you need to add max_new_tokens and set it higher to get better results.
Example:
llm = HuggingFaceEndpoint(
endpoint_url= "/static-proxy?url=https%3A%2F%2Fapi-inference.huggingface.co%2Fmodels%2Ftiiuae%2Ffalcon-7b-instruct",
huggingfacehub_api_token=HUGGINFACE_KEY,
task="text-generation",
model_kwargs = {
"temperature":0.2,
"max_new_tokens":400,
"num_return_sequences":1
}
)
Hello All, I am interested to know falcon performance benchmarking on A100 and T4. I will be thankful if someone can share the inference statistics.
a) GPU type
b) Average inference time per request