vicuna-13B-1.1-GPTQ-4bit-128g not working properly

#8
by AnzacExodus - opened

Hi, I'm currently trying to get the vicuna-13B-1.1-GPTQ-4bit-128g model working on my PC. I'm using a Windows 10 computer and I'm also using Ooba. I've downloaded the repository including the no-act-order version as I have an Nvidia GPU (RTX 3070), but when I type in the chat, my message briefly shows for about a second and then disappears. It seems the only model that works for me is the "TheBloke_vicuna-AlekseyKorshuk-7B-GPTQ-4bit-128g" version.
I'm currently new to all the AI stuff so if anyone can point me in the right direction, I appreciate it.

Thank you.

Same here. After I built the latest possible (for windows) GPTQ-for-LlaMa and using the latest ooba, I get this exception when I type something (when the prompt disappears):

GPTQ-for-LLaMa\quant.py", line 279, in forward
quant_cuda.vecquant4matmul(x.float(), self.qweight, out, self.scales.float(), self.qzeros, self.g_idx)
TypeError: vecquant4matmul(): incompatible function arguments. The following argument types are supported:
1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: torch.Tensor, arg4: torch.Tensor, arg5: int) -> None

Invoked with: tensor([[ 0.0097, -0.0423, 0.2747, ..., -0.0144, 0.0021, 0.0083],
[ 0.0172, 0.0039, -0.0247, ..., -0.0062, -0.0020, -0.0056],
[ 0.0144, 0.0142, -0.0514, ..., 0.0037, 0.0072, 0.0195],

(a bunch of tensor stuff)

device='cuda:0', dtype=torch.int32), tensor([38, 6, 1, ..., 23, 9, 17], device='cuda:0', dtype=torch.int32)

That's the exception I get. It happens both for safetensors and the pt one. Basically it loads the model fine, but throws the exception when trying any input.

Since what you describe is exactly the same issue, I believe it's the same. I tried to build a GPTQ-for-LlaMa from an earlier commit (like 1 month ago) but then it fails because another function is called with a mismatch in the number of arguments.

I don't remember having any better luck running the .pt version with the "installer" version of ooba, but I'm going to try and see if it does any better.

I got .pt to work with the latest installer of ooba but sadly it doesn't allow to use pre_layers, so I run out of VRAM pretty quick. However, I can't make it work with the latest CUDA branch of GPTQ-for-LlaMa either, so pretty much stuck

You get this error if you haven't re-compiled the CUDA branch of GPTQ-for-LLaMa after git cloning it. You need to run

cd GPTQ-for-LLaMa
pip uninstall quant-cuda
python setup_cuda.py install

To compile it, you will need a C/C++ compiler and the CUDA toolkit installed. This may be tricky to do on Windows, but check out this post from another user who did it recently: https://huggingface.co/TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g/discussions/9#644d73b80dc952d245a53624

Note that life is a lot easier in WSL2. Then you can use the Triton branch.

Note that life is a lot easier in WSL2. Then you can use the Triton branch.

Does it work well? I mean, it looks like some kind of virtualization, I'm not sure if the performance will be degraded or if I can even use my GPU (nvidia). But if you recommend it, I will give it a try

Yes it's basically virtualised Linux on Windows. It performs very well. You can definitely use a NVidia GPU in WSL and they have a guide to doing it: https://docs.nvidia.com/cuda/wsl-user-guide/index.html

Thanks!
It worked perfectly. It's a life saver, because nowadays most 4bit quantized models weren't working with ooba, and although I knew that running it on linux might solve it, I honestly didn't want to do any dual boot (I'm a bit old fashioned so that's the first thing that springs to mind)

Great! Glad you got it working

It takes very long to load in a kaggle notebook? Is there any way to load the whole model faster?

It takes very long to load in a kaggle notebook? Is there any way to load the whole model faster?

Are you definitely loading it on the GPU? What is the GPU?

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.04 Driver Version: 450.119.04 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:00:04.0 Off | 0 |
| N/A 34C P0 27W / 250W | 0MiB / 16280MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
It is a Tesla P100 with 16GB RAM.

What version of GPTQ-for-LLaMa are you using, do you know?

I am using the latest version.

You can try adding these command line arguments to text-generation-webui: --quant_attn --fused_mlp, it may provide some speed boost

Sign up or log in to comment