This repository contains improved Mistral-7B quantized models in GGUF format for use with llama.cpp. The models are fully compatible with the oficial llama.cpp release and can be used out=of-the-box.

The table shows a comparison between these models and the current llama.cpp quantization approach using Wikitext perplexities for a context length of 512 tokens. The "Quantization Error" columns in the table are defined as (PPL(quantized model) - PPL(fp16))/PPL(fp16).

Quantization Model file PPL(llama.cpp) Quantization Error PPL(new quants) Quantization Error
Q3_K_S mistral-7b-q3ks.gguf 6.0692 6.62% 6.0021 5.44%
Q3_K_M mistral-7b-q3km.gguf 5.8894 3.46% 5.8489 2.75%
Q4_K_S mistral-7b-q4ks.gguf 5.7764 1.48% 5.7349 0.75%
Q4_K_M mistral-7b-q4km.gguf 5.7539 1.08% 5.7259 0.59%
Q5_K_S mistral-7b-q5ks.gguf 5.7258 0.59% 5.7100 0.31%
Q4_0 mistral-7b-q40.gguf 5.8189 2.23% 5.7924 1.76%
Q4_1 mistral-7b-q41.gguf 5.8244 2.32% 5.7455 0.94%
Q5_0 mistral-7b-q50.gguf 5.7180 0.45% 5.7070 0.26%
Q5_1 mistral-7b-q51.gguf 5.7128 0.36% 5.7057 0.24%

In addition, a 2-bit model is provided (mistral-7b-q2k-extra-small.gguf). It has a perplexity of 6.7099 for a context length of 512, and 5.5744 for a context of 4096.

Downloads last month
64
GGUF
Model size
7.24B params
Architecture
llama
Inference API
Unable to determine this model's library. Check the docs .