"Model is overloaded, please wait for a bit"
Any way to stop this image from popping up?
I was facing this issue a few hours ago. But now it is working on Huggingface's Accelerated Inference! It takes normally more than 90s to generate 64 tokens, with "use_gpu": True, but it runs.
I have the same problem here. I can only get generated_text
from bloom-3b and never succeeded with bloom. Any solutions?
Are you still having the issue? We're recently been moving to AzureML and so service might have been disrupted at some point. But it should be a lot more stable now.
Just to be clear we're talking about the Inference API?
Are you still having the issue? We're recently been moving to AzureML and so service might have been disrupted at some point. But it should be a lot more stable now.
Just to be clear we're talking about the Inference API?
Now it's working (yes, it is the Inference API). But now I have another problem. It seems that even if I specify num_return_sequences
to be more than 1, I can only get 1 generated_text
from bloom. I can get the right number of generated_text
with bloom-3b. Is it because bloom is too big so that it can only do greedy decoding?
We have a custom deployment setup right now for BLOOM (in order to improve inference speed and such), which doesn't support all the options right now. We'll try to support new options as the requests come in I guess.
Is it because bloom is too big so that it can only do greedy decoding?
Actually it does more than greedy decoding, you can add top_k
and top_p
options.
Hi
@TimeRobber
, The max tokens for the API seems to be 1024, although I believe the bloom model can take longer sequences. Andy greater than 1024, and I get the message - Model is overloaded, please wait for a bit
. Is this max length fixed for all users, or paid plans can increase this ?
I think we hard limit incoming requests that are beyond a specific length so that people don't spam our service. In theory if you host the model yourself, you can go to arbitrarily long sequences as it uses relative positional embeddings system that can extrapolate to any length regardless of what was the sequence length when training. More details can be found here: https://arxiv.org/abs/2108.12409
Max length should be fixed for all users. At least from this API endpoint. cc @olivierdehaene
Concerning whether there's a paid plan you'd have to ask @Narsil to confirm, but I think there should be none.
Thanks, re: hosting this ourselves, can I confirm when I call API_URL = "/static-proxy?url=https%3A%2F%2Fapi-inference.huggingface.co%2Fmodels%2Fbigscience%2Fbloom"
I am hitting large bigscience/bloom
version and not one of the smaller versions, like bigscience/bloom7b1
; and also can I confirm if this is running on CPU.
It seems very fast for a CPU model on large bloom - I get 10 seconds. If it was feasible to get this speed on onnx accelerated large bloom, I could try hosting myself.
P.s. I saw in the docu to check x-compute-type
in the headers of the response to check if it is CPU or GPU, but I could not see that values.
when I call API_URL = "/static-proxy?url=https%3A%2F%2Fapi-inference.huggingface.co%2Fmodels%2Fbigscience%2Fbloom%3C%2Fa%3E" I am hitting large bigscience/bloom version and not one of the smaller versions, like bigscience/bloom7b1
Yes you're running the big model
can I confirm if this is running on CPU
No it runs on GPUs in a parallel fashion. You can find more details at https://huggingface.co/blog/bloom-inference-optimization
BLOOM is a more special deployment, and it's currently being powered by AzureML. There won't be cpu inference on BLOOM
Thanks a lot @TimeRobber this is very helpful
If you are interested in the code behind our BLOOM deployment, you can find the new version currently running here: https://github.com/huggingface/text-generation-inference.
The original code described in the the blog post can also be found here: https://github.com/huggingface/transformers_bloom_parallel/.
I have the same problem all night. I just wanted to try it out and see if I can get any kind of response, maybe I have to wait or call the API myself is working better, anyone has a solution?
Hi! Bloom hosting is currently undergoing maintenance by the AzureML team and will be back up as soon as this has been completed. We'll try to get it back up ASAP.
Model is back up.