gemma-2-27b-it-FP8-fix-system-role
Quantized version of gemma-2-27b-it and update chat_template
for support system
role to handle cases:
Conversation roles must alternate user/assistant/user/assistant/...
System role not supported
Model Overview
- Model Architecture: Gemma 2
- Input: Text
- Output: Text
- Model Optimizations:
- Weight quantization: FP8
- Activation quantization: FP8
- Release Date: 04/12/2024
- Version: 1.0
Model Optimizations
This model was obtained by quantizing the weights and activations of gemma-2-27b-it to FP8 data type, ready for inference with vLLM >= 0.5.1. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations. AutoFP8 is used for quantization with a single instance of every token in random order.
Deployment
Use with vLLM
This model can be deployed efficiently using the vLLM backend, as shown in the example below.
With CLI:
vllm serve --model dangvansam/gemma-2-27b-it-FP8-fix-system-role -q fp8
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "dangvansam/gemma-2-27b-it-FP8-fix-system-role",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who are you?"}
]
}'
With Python:
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "dangvansam/gemma-2-27b-it-FP8-fix-system-role"
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "system", "content": "You are helpfull assistant."},
{"role": "user", "content": "Who are you?"}
]
prompts = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
llm = LLM(model=model_id)
outputs = llm.generate(prompts, sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)
vLLM also supports OpenAI-compatible serving. See the documentation for more details.
- Downloads last month
- 28