It now works on Text Generation Inference Engine via LiteLLM Proxy!

#1
by ssakel - opened

This instruct model still needs work in order to behave similar to mistral-7B-instruct.
I run it with Huggingface TGI on a local machine with 4 x 3090 GPUs each with 24GB VRAM.
Unfortunately it will mix in its answers Greek and English. Most of the time the answers were irrelevant to the question.

Institute for Language and Speech Processing org
edited Mar 28, 2024

Hi Spyros @ssakel , Thanks for the feedback.
Would you be willing to share (some of) the chats and the deployment hyperparameters (e.g., temperature) you used with us?

Yes of course!

My setup is the following:

  1. TGI in docker: docker run --gpus '"device=0,1,3,4"' --shm-size 1g -p 8080:80 -v /mnt/vault/fastmodels:/data --name ssake_tgi ghcr.io/huggingface/text-generation-inference:1.4 --model-id /data/Mistral-7B-Instruct-v0.2 --max-input-length 4096 --max-total-tokens 8192 --max-batch-prefill-tokens 4096
  2. litellm OpenAI Proxy with the following YAML:
  • model_name: Meltemi-TGI
    litellm_params:
    model: huggingface/Meltemi-7B-Instruct-v1
    api_base: http://0.0.0.0:8080
  1. Gradio Application written in Python.

Following are a couple of screenshots:

Screenshot from 2024-03-28 18-10-38.png
Screenshot from 2024-03-28 18-08-33.png

  1. TGI in docker: docker run --gpus '"device=0,1,3,4"' --shm-size 1g -p 8080:80 -v /mnt/vault/fastmodels:/data --name ssake_tgi ghcr.io/huggingface/text-generation-inference:1.4 --model-id /data/Mistral-7B-Instruct-v0.2 --max-input-length 4096 --max-total-tokens 8192 --max-batch-prefill-tokens 4096

Why you are using --model-id /data/Mistral-7B-Instruct-v0.2 ?

Institute for Language and Speech Processing org

Yes the --model-id /data/Mistral-7B-Instruct-v0.2 looks curious. Maybe Mistral is used instead of Meltemi?

Also, I'm not sure about the templating mechanism used by tgi.
Meltemi-instruct works well with something like the following: <|system|>\n{{ SYSTEM_PROMPT }}\n</s><|user|>\n{{ USER_PROMPT }}\n</s><|assistant|>, where </s> is the eos token.

In Ollama I tested with the following template code. Can you try something similar in TGI?

{{- if .System }}
<|system|>
{{ .System }}
</s>
{{- end }}
<|user|>
{{ .Prompt }}
</s>
<|assistant|>

Sorry wrong copy paste here is the TGI command

docker run --gpus '"device=0,1,3,4"' --shm-size 1g -p 8080:80 -v /mnt/vault/fastmodels:/data --name ssake_tgi ghcr.io/huggingface/text-generation-inference:1.4 --model-id /data/Meltemi-7B-Instruct-v1 --max-input-length 4096 --max-total-tokens 8192 --max-batch-prefill-tokens 4096

I will try it with ollama and the appropriate template.

Institute for Language and Speech Processing org
edited Mar 28, 2024

Just pushed a quantized version to ollama

Try

ollama run ilsp/meltemi-instruct

If you are using it through the console, I have observed some issues where words are being cut (probably an ollama issue). If you use it through the open web ui this should be fixed.

I just tried it with ollama and seems to work very well (so far I tried it from the command line).
Thanks George Paraskevopoulos @geopar !!!!!!!!!!!

When I run it from the litellm openai proxy, I get an exception, so it could be a template issue as mentioned by @geopar . I will try to put a custom template -as per @geopar instructions- in litellm proxy and see if it works. Hopefully this will work for accessing Meltemi via both ollama and TGI without worrying about which model/template I use.

ssakel changed discussion title from Still needs work to It now works on Text Generation Inference Engine via LiteLLM Proxy!

Using the template of @geopar on Litellm configuration it worked perfectly with TGI. Thank you guys for this model that speaks Greek!!!! And thanks to @geopar for his help!!!

BTY: I changed the comment's title as it was misleading, since this was a template config issue.

FYI:
This is the Litellm YAML config for running Meltemi on TGI:

  • model_name: Meltemi-TGI
    litellm_params:
    model: huggingface/Meltemi-7B-Instruct-v1
    api_base: http://0.0.0.0:8080
    roles: {"system":{"pre_message":"<|system|>system\n", "post_message":""}, "user":{"pre_message":"<|user|>user\n","post_message":""}, "assistant":{"pre_message":"<|assistant|>assistant\n","post_message":" "}}

I tested it on 4xNvidia 3090 (total 96gb vram) cards and an older 1xQuadro M6000-24gb card and it worked exactly the same except for the speed of course. Here are the TGI logs and statistics for each configuration asking the same question:

INFO compat_generate{default_return_full_text=true compute_type=Extension(ComputeType("1-quadro-m6000-24gb"))}:generate_stream{parameters=GenerateParameters { best_of: None, temperature: Some(0.3), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(2020), return_full_text: Some(false), stop: [], truncate: None, watermark: false, details: true, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="95.490662472s" validation_time="412.864µs" queue_time="43.181µs" inference_time="95.490206667s" time_per_token="134.115458ms" seed="Some(1510580074201680330)"}: text_generation_router::server: router/src/server.rs:489: Success

INFO compat_generate{default_return_full_text=true compute_type=Extension(ComputeType("4-nvidia-geforce-rtx-3090"))}:generate_stream{parameters=GenerateParameter
s { best_of: None, temperature: Some(0.3), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(2020), return_
full_text: Some(false), stop: [], truncate: None, watermark: false, details: true, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="9.178317375s" val
idation_time="774.753µs" queue_time="48.772µs" inference_time="9.177494051s" time_per_token="14.613844ms" seed="Some(12459441799522371169)"}: text_generation_router::server: router/src/serve
r.rs:489: Success

Screenshot from 2024-03-30 10-50-56.png

Institute for Language and Speech Processing org

Thank you @ssakel for sharing this integration and your configuration!
If you haven't seen it yet, we have also uploaded quantized versions that can help with the deployment

https://huggingface.co/ilsp/Meltemi-7B-Instruct-v1-AWQ
https://huggingface.co/ilsp/Meltemi-7B-Instruct-v1-GGUF

Closing the issue for now as resolved

geopar changed discussion status to closed

Sign up or log in to comment