--- tags: - fp8 - vllm language: - en - de - fr - it - pt - hi - es - th pipeline_tag: image-text-to-text license: apache-2.0 library_name: vllm base_model: - mistral-community/pixtral-12b - mistralai/Pixtral-12B-2409 base_model_relation: quantized datasets: - HuggingFaceH4/ultrachat_200k --- # Pixtral-12B-2409: FP8 Dynamic Quant + FP8 KV Cache Quant of [mistral-community/pixtral-12b](https://huggingface.co/mistral-community/pixtral-12b) using [LLM Compressor](https://github.com/vllm-project/llm-compressor) for optimised inference on VLLM. FP8 dynamic quant on language model, and FP8 quant of KV cache. multi_modal_projector and vision_tower left in FP16 since it's a small part of the model. Calibrated on 2048 ultrachat samples. Example VLLM usage ``` vllm serve nintwentydo/pixtral-12b-FP8-dynamic-FP8-KV-cache --quantization fp8 --kv-cache-dtype fp8 ``` Supported on Nvidia GPUs with compute capability > 8.9 (Ada Lovelace, Hopper). **Edit:** Something seems to be wrong with the tokenizer. If you have any issues add `--tokenizer mistral-community/pixtral-12b` to your VLLM command line args.