What is best way to serve huggingface model with API?

Use TF or PyTorch?

For PyTorch TorchServe or Pipelines with something like flask?

1 Like

You have a few different options, here are some in increasing level of difficulty

  1. You can use the Hugging Face Inference API via Model Hub if you are just looking for a demo.
  2. You can use a hosted model deployment platform: GCP AI predictions, SageMaker, https://modelzoo.dev/. Full disclaimer, I am the developer behind Model Zoo, happy to give you some credits for experimentation.
  3. You can roll your own model server with something like https://fastapi.tiangolo.com/ and deploy it on a generic serving platform like AWS Elastic Beanstalk or Heroku. This is the most flexible option.

@yoavz Hey I am also looking for an answer regarding this, can you give more reference or tutorial regarding this? Thank you

1 Like

Sure – here is are more links for each path:

  1. Hugging Face Model Hub: https://huggingface.co/transformers/model_sharing.html
  2. Model Zoo: https://docs.modelzoo.dev/quickstart/transformers.html
  1. Roll your own deployment stack: https://github.com/curiousily/Deploy-BERT-for-Sentiment-Analysis-with-FastAPI

Interested in model serving too. I don’t think that FastAPI stack is what we want - good for quickstart, but it’s preferable to have the FastAPI serve your web API and a job queue (eg RabbitMQ) for submitting expensive GPU jobs. I currently have an EC2 instance I spin-up on demand from FastAPI server, submit job, receive results, send to client. Alternatively you can use AWS Batch with a Dockerfile for your transformers models. But in the spin-up case, scaling is a real pain; and in the Batch case, huge overhead in the Batch job coming online just for inference.

What we really want is proper cloud hosting of models, eg via GCP AI Platform. I’m not sure if Model Zoo serves this purpose, I’ll check it out ASAP. I do see https://github.com/maxzzze/transformers-ai-platform/tree/master/models/classification, but its last commit is Feb 10 & my quick-scan of the repo makes me think it might be a bit rigid and will take a fair bit of tinkering for flexible use-cases.

What would really be handy is a tutorial on deploying transformers models to GCP AI. How to prepare & upload; how to separate surrounding code (model prep, tokenization prep, etc); how to deal with their 500mb model quota; all that stuff. Ideally there’d be some fairly 1st-class huggingface exporter, or on-site tutorial.

Actually, this could be a business prop for Hugginface: host your models, and charge for API calls! We’d dev locally to get things sorted, but then switch to API so we don’t have to worry about instance scaling & the like. Anyway, I’ll check out Model Zoo in case that’s what it does.

2 Likes

I have an shared an example using Torchserve (for the NER use-case) but it can be extended to other types by using different pipelines.
blogpost and repo
Includes a demo UI too!
(can’t include more links because I’m a new user on this forum…just refer to the blogpost)
Hope it helps~

1 Like

Is there a way to serve the hugging face bert model with TF serving such that the TF serving handles the tokenization along with inference? Any related documentation or blog post?

@jplu might help with this

1 Like

Hi @anubhavmaity !

Thanks for your question, unfortunately it is currently not possible to integrate the tokenization process along with inference directly inside a saved model. Nevertheless, it is part of our plans to make this available and we are currently rethinking the way the saved models are handled in transformers :slight_smile:

1 Like

I know this is old, but have you seen this ? https://huggingface.co/pricing.

Basically exactly what you’re asking for. We’re hosting your models and running them at scale !

1 Like

We are considering - Deploying the models in Sagemaker vs Deploying in EC2.
What is others’ opinion about this.

Sagemaker -

  • We found that there are limited models available in Sagemaker and have dependencies such as some models not available in certain regions

  • Having a model in S3 bucket may not go well with some regulations which need data to be present locally

  • We found it expensive. Currently, we want to run it for a while for testing and when not in use, wanted to shut down the instance. But leaving the dev env intact. Sagemaker posed some limitations. Though doable but more work.

Serving model through Django/REST API server:
Currently exploring, downloading a model on EC2 and then running infrence client in an async loop. Thus client->Rest API->Routed to Hugging face infrence objects like Pipeline…

AWS Infrentia servers
Still checking with AWS if that’s a better possibility. The end goal would be to have better latencies and cost optimizations vs EC2. However, it’s not a trouble for us for now as in development/testing - we will have minimal flow.

Would be good to hear others’ thoughts and experience.