nintwentydo
/

pixtral-12b-FP8-dynamic-FP8-KV-cache

Image-Text-to-Text

compressed-tensors

Model card Files Files and versions Community

pixtral-12b-FP8-dynamic-FP8-KV-cache / README.md

nintwentydo's picture

Update README.md

1ff7a22 verified 10 days ago

|

history blame contribute delete

1.12 kB

	---
	tags:
	- fp8
	- vllm
	language:
	- en
	- de
	- fr
	- it
	- pt
	- hi
	- es
	- th
	pipeline_tag: image-text-to-text
	license: apache-2.0
	library_name: vllm
	base_model:
	- mistral-community/pixtral-12b
	- mistralai/Pixtral-12B-2409
	base_model_relation: quantized
	datasets:
	- HuggingFaceH4/ultrachat_200k
	---

	# Pixtral-12B-2409: FP8 Dynamic Quant + FP8 KV Cache

	Quant of [mistral-community/pixtral-12b](https://huggingface.co/mistral-community/pixtral-12b) using [LLM Compressor](https://github.com/vllm-project/llm-compressor) for optimised inference on VLLM.

	FP8 dynamic quant on language model, and FP8 quant of KV cache. multi_modal_projector and vision_tower left in FP16 since it's a small part of the model.

	Calibrated on 2048 ultrachat samples.

	Example VLLM usage
	```
	vllm serve nintwentydo/pixtral-12b-FP8-dynamic-FP8-KV-cache --quantization fp8 --kv-cache-dtype fp8
	```

	Supported on Nvidia GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).

	Edit: Something seems to be wrong with the tokenizer. If you have any issues add `--tokenizer mistral-community/pixtral-12b` to your VLLM command line args.