File size: 1,115 Bytes
72f5202 cb89c5a 1ff7a22 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
---
tags:
- fp8
- vllm
language:
- en
- de
- fr
- it
- pt
- hi
- es
- th
pipeline_tag: image-text-to-text
license: apache-2.0
library_name: vllm
base_model:
- mistral-community/pixtral-12b
- mistralai/Pixtral-12B-2409
base_model_relation: quantized
datasets:
- HuggingFaceH4/ultrachat_200k
---
# Pixtral-12B-2409: FP8 Dynamic Quant + FP8 KV Cache
Quant of [mistral-community/pixtral-12b](https://huggingface.co/mistral-community/pixtral-12b) using [LLM Compressor](https://github.com/vllm-project/llm-compressor) for optimised inference on VLLM.
FP8 dynamic quant on language model, and FP8 quant of KV cache. multi_modal_projector and vision_tower left in FP16 since it's a small part of the model.
Calibrated on 2048 ultrachat samples.
Example VLLM usage
```
vllm serve nintwentydo/pixtral-12b-FP8-dynamic-FP8-KV-cache --quantization fp8 --kv-cache-dtype fp8
```
Supported on Nvidia GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
**Edit:** Something seems to be wrong with the tokenizer. If you have any issues add `--tokenizer mistral-community/pixtral-12b` to your VLLM command line args. |