|
--- |
|
tags: |
|
- fp8 |
|
- vllm |
|
language: |
|
- en |
|
- de |
|
- fr |
|
- it |
|
- pt |
|
- hi |
|
- es |
|
- th |
|
pipeline_tag: image-text-to-text |
|
license: apache-2.0 |
|
library_name: vllm |
|
base_model: |
|
- mistral-community/pixtral-12b |
|
- mistralai/Pixtral-12B-2409 |
|
base_model_relation: quantized |
|
datasets: |
|
- HuggingFaceH4/ultrachat_200k |
|
--- |
|
|
|
# Pixtral-12B-2409: FP8 Dynamic Quant + FP8 KV Cache |
|
|
|
Quant of [mistral-community/pixtral-12b](https://huggingface.co/mistral-community/pixtral-12b) using [LLM Compressor](https://github.com/vllm-project/llm-compressor) for optimised inference on VLLM. |
|
|
|
FP8 dynamic quant on language model, and FP8 quant of KV cache. multi_modal_projector and vision_tower left in FP16 since it's a small part of the model. |
|
|
|
Calibrated on 2048 ultrachat samples. |
|
|
|
Example VLLM usage |
|
``` |
|
vllm serve nintwentydo/pixtral-12b-FP8-dynamic-FP8-KV-cache --quantization fp8 --kv-cache-dtype fp8 |
|
``` |
|
|
|
Supported on Nvidia GPUs with compute capability > 8.9 (Ada Lovelace, Hopper). |