Quantization suggestion
Always quantize the output and the embed tensors to f16 and the others to q4-q8.
Quantizing output and embed tensors more will degrade the model a lot.
I tried quantizing output/embed to q8 but the results were way worse than f16.
My best recipe is f16 for output/embed and q5_k or q6_k for the others.
Examples:
quantize.exe --allow-requantize --output-tensor-type f16 --token-embedding-type f16 model.f16.gguf model.f16.q5.gguf q5_k
quantize.exe --allow-requantize --output-tensor-type f16 --token-embedding-type f16 model.f16.gguf model.f16.q6.gguf q6_k
quantize.exe --allow-requantize --output-tensor-type f16 --token-embedding-type f16 model.f16.gguf model.f16.q6.gguf q8_0
interesting, i've never heard anyone mention this before, i may look into this to see how it works..
interesting, i've never heard anyone mention this before, i may look into this to see how it works..
You will see the difference. Consider the embed tensor as their "understanding" and the ouput tensor as their "speaking".
I want those to be as untouched as possible and the reasoning in the middle could be less detailed...
Let's say you need to describe if something is good or bad, wurely 2 bits would be seeing the world in black and white... but 8 bit (256 shades between good and bad) could be overkill. So I tried various quantizations and changing the output and the embed to even Q8 had a more significant impact than quantizing more the "inner" tensors.
If instead of testing them sinthetically you chat with them for a while, you will see the difference.
To counter prove my theory I even did the opposite: I quantized heavily the output and embed and lightly the inner ones.. the result was horrible and it seemed to chat with a brain damaged child.
F16 seems a bit aggressive though, it massively increased the size of the quant, I could consider Q8 though, I imagine that there's hardly a difference between f16 and Q8, especially compared to Q5
I'd also be interested in running some PPL tests.. may be interesting
quick test
Quant - embed/output - PPL
Q8 - default - 6.8556 +/- 0.04371
Q8 - f16 - 6.8543 +/- 0.04370
Q4_K_M - default - 6.8888 +/- 0.04385
Q4_K_M - f16 - 6.8844 +/- 0.04382
considering it's 2-300 mb larger for 0.004 PPL.. it's hard to be sure if this is worth, got any more reliable tests..?
Sincerely no, but I use to chat with some models (mistral v03 instruct for example) and the difference is huge both in understand and expressing, considering the slight increase in size. Perhaps in synthetic tests it doesn't show up, no idea. If you try the opposite you will understand how impactful are the output/embed tesnors compared to the other ones... I tried quantizing q6_k and then f16 for the inner ones.. and the difference was huge (way worse and sort of lobotomized).
q4 is quite a stretch... the best results (with mistral 7B v03 instruct, wizardLM-2 7B and Llama-3-8B) where using f16 for output/embed and 16_k or 15_k for the others.
I'd really like it if you had some hard examples, even just a prompt that shows a difference in generation, otherwise I'm not sure I can justify it
I can consider Q8, but f16 is a sizeable increase in final model, to the point where I question the value vs using a bigger quant level
this is definitely worth continuing investigation but I won't - yet - blindly implement it across the board, I do really appreciate the suggestion and bringing it to my attention
This has caught my attention, so I did some comparisons, and the f16 variant has won pretty much every time. Here is a side-by-side comparison of L3-8B-Stheno-v3.2 summarizing Chapter 2032 from the webnovel Overgeared, which is roughly 2700 tokens long:
Q8_0:
It appears that Grid has arrived at the Collapsed Temple, where the demon cultivators are hiding. Marie Rose has used her blood magic to clear a path through the poisonous lake, revealing a massive temple at the end. Noe has also joined Grid, bearing the armor and gear crafted by Khan and Hexetia. The situation is tense as Grid prepares to confront the cultivators, but Marie Rose's actions have given them an advantage.
f16:
The narrative now shifts to Grid and his companions approaching the temple, with Marie Rose having cleared a path through the poisonous lake using her blood magic. Noe, who had been guided by angels, joins them, though he seems disoriented and exhausted from his journey. Grid, meanwhile, remains focused on their mission, dispatching Noe to fight without hesitation. The story hints at the complex world politics and alliances at play, with Asgard and the demon cultivators now on the same side against Grid and the Overgeared Kingdom. The chapter ends with a sense of foreboding and tension as the group prepares to face whatever awaits them within the ancient temple.
The Q8_0 essentially just picked a few sentences and spliced them together, while the f16 actually gave a deeper analysis of the chapter. In my other comparisons, I generally found that the Q8_0 can make obvious connections between things, while the f16 can understand more without having to explicitly tell it. The f16 had a file size increase of about 900~ MegaBytes, which, for me, is definitely worth it. I can't effectively test 70B models on my machine, but the embeddings might be one of the reasons quantization affects smaller models more than larger ones.
I wonder if it would be worth considering releasing a few side by side, like a pseudo Q8+ for those who want to push quality even further.. especially for these smaller models it would be almost negligible for compute to make 2 extra sizes, one for Q8 and one for Q6
As I said, my tests are just chatting for a long time with them about any subject that comes to mind or just brainstorming.
If I quantize the embed and output to q6_k (for example) and the others at f16 I get a horrible result. Lobotomized/braindamaged child.
If I do the opposite I get very close to the normal "pure" F16. Sometimes I can't tell the difference.
And the "size" of those 2 experiment is almost the same because the embed+output tensors are as big as the rest.
This tells me that a good quantization is the one I proposed (or variations of it) considering always that the ouput tensor influences the "expression" and the more that gets quantized the more the model speaks like a child.
The embed tensor instead seems to influence it's understanding.
Probably I am not discovering anything or I am just reinventing the wheel... but I thought it was worth mentioning it.
I wonder if it would be worth considering releasing a few side by side, like a pseudo Q8+ for those who want to push quality even further.. especially for these smaller models it would be almost negligible for compute to make 2 extra sizes, one for Q8 and one for Q6
I usually first convert the HF model to f16. Then I produce my quantizations using this naming template:
model.f16.q6.gguf or f16.q5 or q8.q4 (this gives very bad results but can be useful for simpler tasks)
where f16 represents the embed and output tensors and the other represents the "inner" tensors.
some examples are here: https://huggingface.co/ZeroWw/Test/tree/main
also I found that a "pure" Q8_0 performs way worse than an f16/q5 which has practically the same size.
I am curious of your tests with other models...
14,484,731,552 WizardLM-2-7B.fp16.gguf
4,263,540,448 WizardLM-2-7B.fp16.q4.gguf
5,131,409,120 WizardLM-2-7B.fp16.q5.gguf
5,942,064,864 WizardLM-2-7B.fp16.q6.gguf
4,368,438,976 zephyr-7b-beta.Q4_K_M.gguf
14,484,732,192 zephyr-orpo-7b-v0.2.f16.gguf
5,458,065,696 zephyr-orpo-7b-v0.2.f16.q5.gguf
6,251,313,440 zephyr-orpo-7b-v0.2.f16.q6.gguf
in these, the difference between the f16 (14gb) models and the f16/q5 and f16/q6 is minimal.
I start to notice degradation in f16/q4.
I am still studying because in the same way, some inner tensors could be quantized more than others... but that will take long to test...
This has caught my attention, so I did some comparisons, and the f16 variant has won pretty much every time. Here is a side-by-side comparison of L3-8B-Stheno-v3.2 summarizing Chapter 2032 from the webnovel Overgeared, which is roughly 2700 tokens long:
Yep! Summarizing requires understanding (embed) and then expressing (output). It's probably a good test for this.
The more difficult the subject is, the more reasoning is needed, the more the results will be evident.
I also think that there should be leaderboards organized in this way:
- size of the model. (in bytes)
- reasoning and expressing evaluation.
Number 2 could be accomplished by summarizing different and complex narratives or scientific papers.
With the strict rule to use original and not finetuned (on those tasks) models.
That would probably prompt people to start optimizing for size and efficiency.
That's just my thought... perhaps naive...
I'll release a couple with the f16 embed and output with the normal ones later today and you can make some side by side comparisons
Another important thig is this: from my tests, if I quantize a model to a pure Q8, it comes out bigger and dumber that am f16/q6 or f16/q5 which on average are 20-30% smaller.
@turboderp do you know if a similar approach has already been tried in exllama2? It seems that exl2 does some calibration to find the best quant per layer etc, maybe you've explored this?
This hasn't been an issue with Phi3 or any other model to my knowledge. All the objective tests I can do show that a quantized head layer works fine for this model (difference compared to FP16 model vanishes completely around 6 bpw). So if it's subjectively dumber somehow, I have no idea why that would be. And I wouldn't know where to begin investigating it without something a little more concrete to go on.
Can't say if there's anything particular about GGUF that causes it to clamp the logits differently when the output layer is FP16, and maybe that has an effect at extreme temperatures or something?
i don't think it's specific to phi3
i also don't know that i believe there is that big a difference, i'd want to see more side-by-side comparisons to confirm any changes in behaviour before i commit to doing it too much
same can be done with exl2 I suppose. can the head bits even go about 8?
Here are the models I quantized so far with that method:
https://huggingface.co/ZeroWw/Samantha-Qwen-2-7B-GGUF
https://huggingface.co/ZeroWw/Mistral-7B-Instruct-v0.3-GGUF
https://huggingface.co/ZeroWw/microsoft_WizardLM-2-7B-GGUF
https://huggingface.co/ZeroWw/Meta-Llama-3-8B-Instruct-GGUF
https://huggingface.co/ZeroWw/Mistroll-7B-v2.2-GGUF
Hi Guys,
Im pretty new to quants and local LLMs in general, but find it very interesting and am willing to learn from those who know better than me.
Would you mind sharing how and what quant commands you use and am playing with?
Thank you in advance
I use the command quantize in llama.cpp but differently from the normal quantizations that quantize every tensor in the same way, I quantize the output and embed tensors to f16 (little) and all the other tensors to q5_k or q6_k. Thus obtaining smaller models but with less degradation.
You can find my quants in my profile (models) thay are all quantized in this way:
echo Quantizing f16/q5
./build/bin/llama-quantize &>/dev/null --allow-requantize --output-tensor-type f16 --token-embedding-type f16 ${model_name}.f16.gguf ${model_name}.f16.q5.gguf q5_k $(nproc)
echo Quantizing f16/q6
./build/bin/llama-quantize &>/dev/null --allow-requantize --output-tensor-type f16 --token-embedding-type f16 ${model_name}.f16.gguf ${model_name}.f16.q6.gguf q6_k $(nproc)
echo Quantizing q8_0
./build/bin/llama-quantize &>/dev/null --allow-requantize --output-tensor-type f16 --token-embedding-type f16 ${model_name}.f16.gguf ${model_name}.f16.q8.gguf q8_0 $(nproc)
First, you need to convert the huggingface model into a gguf. You use this command if the models is in bf16:python ./convert-hf-to-gguf.py --outtype bf16 {model_directory}
, and if it's in f16, then just delete --outtype bf16
. You can find out what type it is in the model's config.json
.
Once you have the f16 or bf16 gguf, then you can use llama-quantize
to then quantize the model. Here's the command you would use to keep the embeddings and output tensors at f16: llama-quantize --allow-requantize --output-tensor-type f16 --token-embedding-type f16 {input_model_name}.gguf {output_model_name}.gguf {quantization_level}
If you don't want to keep the embeddings and output tensors at f16, then just remove --allow-requantize --output-tensor-type f16 --token-embedding-type f16
. There's many levels of quantization to choose from: Q8_0, Q6_K, Q5_K_M, Q5_K_S, Q4_K_M, Q4_K_S, Q3_K_L, Q3_K_M, Q3_K_S, Q2_K. Q8_0 is the highest quality available, and Q2_K is the lowest quality available.
Or just download the models already quantized by me.. I made quite a few and you'll find them in my profile.
ALL the models were quantized in this way:
quantize.exe --allow-requantize --output-tensor-type f16 --token-embedding-type f16 model.f16.gguf model.f16.q5.gguf q5_k
quantize.exe --allow-requantize --output-tensor-type f16 --token-embedding-type f16 model.f16.gguf model.f16.q6.gguf q6_k
quantize.exe --allow-requantize --output-tensor-type f16 --token-embedding-type f16 model.f16.gguf model.f16.q6.gguf q8_0
and there is also a pure f16 in every directory.
- ZeroWw/Llama-3-8B-Instruct-Gradient-1048k-GGUF
- ZeroWw/Pythia-Chat-Base-7B-GGUF
- ZeroWw/Yi-1.5-6B-Chat-GGUF
- ZeroWw/DeepSeek-Coder-V2-Lite-Base-GGUF
- ZeroWw/Yi-1.5-9B-32K-GGUF
- ZeroWw/aya-23-8B-GGUF
- ZeroWw/MixTAO-7Bx2-MoE-v8.1-GGUF
- ZeroWw/Phi-3-medium-128k-instruct-GGUF
- ZeroWw/Phi-3-mini-128k-instruct-GGUF
- ZeroWw/Qwen1.5-7B-Chat-GGUF
- ZeroWw/NeuralDaredevil-8B-abliterated-GGUF
- ZeroWw/Mistroll-7B-v2.2-GGUF
- ZeroWw/Samantha-Qwen-2-7B-GGUF
- ZeroWw/Meta-Llama-3-8B-Instruct-GGUF
- ZeroWw/NSFW_DPO_Noromaid-7b-Mistral-7B-Instruct-v0.1-GGUF
- ZeroWw/microsoft_WizardLM-2-7B-GGUF
- ZeroWw/Mistral-7B-Instruct-v0.3-GGUF