|
--- |
|
license: other |
|
inference: true |
|
tags: |
|
- vicuna |
|
--- |
|
![demo](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_08.png) |
|
|
|
This is a 8bit GPTQ (not to be confused with 8bit RTN) version of Vicuna 13B v1.1 HF. |
|
|
|
Q. Why quantized in 8bit instead of 4bit? |
|
A. For evaluation purpose. In theory, a 8bit quantized model should provide slightly better perplexity (maybe not noticeable - To Be Evaluated...) over a 4bit quatized version. If your available GPU VRAM is over 15GB you may want to try this out. |
|
Note that quatization in 8bit does not mean loading the model in 8bit precision. Loading your model in 8bit precision (--load-in-8bit) comes with noticeable quality (perplexity) degradation. |
|
|
|
Refs: |
|
- https://github.com/ggerganov/llama.cpp/pull/951 |
|
- https://news.ycombinator.com/item?id=35148542 |
|
- https://github.com/ggerganov/llama.cpp/issues/53 |
|
- https://arxiv.org/abs/2210.17323 |
|
- https://arxiv.org/abs/2105.03536 |
|
- https://arxiv.org/abs/2212.09720 |
|
- https://arxiv.org/abs/2301.00774 |
|
- https://github.com/IST-DASLab/gptq |
|
|
|
<br> |
|
|
|
**This model is a 8bit quantization of Vicuna 13Bv1.1.** |
|
- 13B parameters |
|
- Group size: 128 |
|
- wbits: 8 |
|
- true-sequential: yes |
|
- act-order: yes |
|
- 8-bit GPTQ |
|
- c4 |
|
- Conversion process: LLaMa 13B -> LLaMa 13B HF -> Vicuna13B-v1.1 HF -> Vicuna13B-v1.1-8bit-128g |
|
|
|
<br> |
|
<br> |
|
|
|
# Benchmarks |
|
|
|
Using https://github.com/qwopqwop200/GPTQ-for-LLaMa/. Best results in **bold**. |
|
|
|
*`--benchmark 2048 --check` results:* |
|
| Model | wikitext2 PPL | ptb PPL | c4 PPL | VRAM Utilization | |
|
|---|---|---|---|---| |
|
| 4bit-GPTQ - TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g | 8.517391204833984 | 20.888103485107422 | **7.058407783508301** | **8670.26953125** | |
|
| 8bit-GPTQ - Thireus/Vicuna13B-v1.1-8bit-128g | **8.508771896362305** | **20.75649070739746** | 7.105874538421631 | 14840.26171875 | |
|
|
|
*`--eval` results:* |
|
| Model | wikitext2 PPL | ptb PPL | c4 PPL | |
|
|---|---|---|---| |
|
| 4bit-GPTQ - TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g | 7.119165420532227 | 25.692861557006836 | 9.06746768951416 | |
|
| 8bit-GPTQ - Thireus/Vicuna13B-v1.1-8bit-128g | **6.988043308258057** | **24.882535934448242** | **8.991846084594727** | |
|
|
|
*`--new-eval --eval` results:* |
|
| Model | wikitext2 PPL | ptb-new PPL | c4-new PPL | |
|
|---|---|---|---| |
|
| 4bit-GPTQ - TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g | 7.119165420532227 | 35.637290954589844 | 9.550592422485352 | |
|
| 8bit-GPTQ - Thireus/Vicuna13B-v1.1-8bit-128g | **6.988043308258057** | **34.264320373535156** | **9.426002502441406** | |
|
|
|
PPL = Perplexity (lower is better) - https://huggingface.co/docs/transformers/perplexity |
|
|
|
<br> |
|
<br> |
|
|
|
# Basic installation procedure |
|
|
|
- It was a nightmare, I will only detail briefly what you'll need. WSL was quite painful to sort out. |
|
- I will not provide installation support, sorry. |
|
- You can certainly use llama.cpp and other loaders that support 8bit quantization, I just chose oobabooga/text-generation-webui. |
|
- You will likely face many bugs until text-generation-webui loads, ranging between missing PATH or env variables to having to manually pip uninstall/install packages. |
|
- The notes below will likely become outdated once both text-generation-webui and GPTQ-for-LLaMa receive the appropriate bug fixes. |
|
- If this model produces very slow answers (1 token/s), it means you are not using Cuda for bitsandbytes or that your hardware needs an upgrade. |
|
- If this model produces answers with weird characters, it means you a using a broken commit of qwopqwop200/GPTQ-for-LLaMa. |
|
- If this model produces answers that are out of topic or if it talks to itself, it means you a using a broken commit of qwopqwop200/GPTQ-for-LLaMa. |
|
|
|
RECOMMENDED - Triton (Fast tokens/s) - Works on Windows with WSL (what I've used) or Linux: |
|
``` |
|
git clone https://github.com/oobabooga/text-generation-webui |
|
cd text-generation-webui |
|
#git fetch origin pull/1229/head:triton # Since been merged # This is the version that supports Triton - https://github.com/oobabooga/text-generation-webui/pull/1229 |
|
git checkout triton |
|
pip install -r requirements.txt |
|
|
|
mkdir repositories |
|
cd repositories |
|
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git # -b cuda |
|
cd GPTQ-for-LLaMa |
|
#git checkout 508de42 # Since been fixed # Before qwopqwop200 broke everything... - https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/183 |
|
git checkout 210c379 # Optional - This is a commit I have verified, you may want to try the latest commit instead, if the latest commit doesn't work revert to an older one such as this one |
|
pip install -r requirements.txt |
|
``` |
|
|
|
DISCOURAGED - Cuda (Slow tokens/s) and output issues https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/128: |
|
``` |
|
git clone https://github.com/oobabooga/text-generation-webui |
|
cd text-generation-webui |
|
pip install -r requirements.txt |
|
|
|
mkdir repositories |
|
cd repositories |
|
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b cuda # Make sure you obtain the qwopqwop200 version, not the oobabooga one! (because "act-order: yes") |
|
cd GPTQ-for-LLaMa |
|
git checkout 505c2c7 # Optional - This is a commit I have verified, you may want to try the latest commit instead, if the latest commit doesn't work revert to an older one such as this one |
|
pip install -r requirements.txt |
|
python setup_cuda.py install |
|
``` |
|
|
|
<br> |
|
<br> |
|
|
|
# Testbench detail and demo |
|
|
|
- Latest version of oobabooga + https://github.com/oobabooga/text-generation-webui/pull/1229 |
|
|
|
- NVIDIA GTX 3090 |
|
- 32BG DDR4 |
|
- i9-7980XE OC @4.6Ghz |
|
|
|
- 11 tokens/s on average with Triton |
|
- Equivalent tokens/s observed over the 4bit version |
|
- Pending preliminary observation: better quality results than 8bit RTN / --load-in-8bits (To Be Confirmed) |
|
- Observation: better quality results than 4bit GPTQ (c.f. PPL benchmarks above) at the cost of +71% VRAM usage. |
|
- Tested and working in both chat mode and text generation mode |
|
|
|
![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_01.png) |
|
|
|
![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_02.png) |
|
|
|
![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_03.png) |
|
|
|
![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_04.png) |
|
|
|
![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_05.png) |
|
|
|
![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_06.png) |
|
|
|
![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_07.png) |
|
|
|
<br> |
|
<br> |
|
|
|
# License |
|
|
|
Research only - non-commercial research purposes - other restrictions apply. See inherited LICENSE file from LLaMa. |
|
|
|
LLaMA-13B converted to work with Transformers/HuggingFace is under a special license, please see the LICENSE file for details. |
|
|
|
https://www.reddit.com/r/LocalLLaMA/comments/12kl68j/comment/jg31ufe/ |
|
|
|
<br> |
|
<br> |
|
|
|
# Vicuna Model Card |
|
|
|
## Model details |
|
|
|
**Model type:** |
|
Vicuna is an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. |
|
It is an auto-regressive language model, based on the transformer architecture. |
|
|
|
**Model date:** |
|
Vicuna was trained between March 2023 and April 2023. |
|
|
|
**Organizations developing the model:** |
|
The Vicuna team with members from UC Berkeley, CMU, Stanford, and UC San Diego. |
|
|
|
**Paper or resources for more information:** |
|
https://vicuna.lmsys.org/ |
|
|
|
**License:** |
|
Apache License 2.0 |
|
|
|
**Where to send questions or comments about the model:** |
|
https://github.com/lm-sys/FastChat/issues |
|
|
|
## Intended use |
|
**Primary intended uses:** |
|
The primary use of Vicuna is research on large language models and chatbots. |
|
|
|
**Primary intended users:** |
|
The primary intended users of the model are researchers and hobbyists in natural language processing, machine learning, and artificial intelligence. |
|
|
|
## Training dataset |
|
70K conversations collected from ShareGPT.com. |
|
|
|
## Evaluation dataset |
|
A preliminary evaluation of the model quality is conducted by creating a set of 80 diverse questions and utilizing GPT-4 to judge the model outputs. See https://vicuna.lmsys.org/ for more details. |
|
|
|
## Major updates of weights v1.1 |
|
- Refactor the tokenization and separator. In Vicuna v1.1, the separator has been changed from `"###"` to the EOS token `"</s>"`. This change makes it easier to determine the generation stop criteria and enables better compatibility with other libraries. |
|
- Fix the supervised fine-tuning loss computation for better model quality. |