The `main` branch for TheBloke/Llama-2-70B-GPTQ appears borked

#3
by Aivean - opened

Using the latest oobabooga/text-generation-webui on runpod. Tried two different GPUs (L40 48 GB and A100 80GB), ExLLama loader.

The model loads successful (nothing in the logs), but fails during the inference:

Traceback (most recent call last):
  File "/workspace/text-generation-webui/modules/text_generation.py", line 331, in generate_reply_custom
    for reply in shared.model.generate_with_streaming(question, state):
  File "/workspace/text-generation-webui/modules/exllama.py", line 98, in generate_with_streaming
    self.generator.gen_begin_reuse(ids)
  File "/usr/local/lib/python3.10/dist-packages/exllama/generator.py", line 186, in gen_begin_reuse
    self.gen_begin(in_tokens)
  File "/usr/local/lib/python3.10/dist-packages/exllama/generator.py", line 171, in gen_begin
    self.model.forward(self.sequence[:, :-1], self.cache, preprocess_only = True, lora = self.lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 849, in forward
    r = self._forward(input_ids[:, chunk_begin : chunk_end],
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 930, in _forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 470, in forward
    hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 388, in forward
    key_states = key_states.view(bsz, q_len, self.config.num_attention_heads, self.config.head_dim).transpose(1, 2)
RuntimeError: shape '[1, 525, 64, 128]' is invalid for input of size 537600

Interestingly enough, a very small prompt (like 'Hello') works.

Tried other loaders, similar issues. Tried Llama 2 13b, and it worked.

Tried gptq-4bit-64g-actorder_True quantization on A100, same error. All settings are default. My steps are literally: start pod, download model, load it, try generate.

Same error here on a A100 80GB.

There's an architecture change with 70B.
ExLLaMA and AutoGPTQ issue.

There's an architecture change with 70B.
ExLLaMA and AutoGPTQ issue.

Do you mean there is a difference between 13b and 70b (former works fine)?

In this case usage instructions and compatibility info should be updated:
https://huggingface.co/TheBloke/Llama-2-70B-GPTQ#how-to-easily-download-and-use-this-model-in-text-generation-webui

Same issue on 2xA6000.

This is because the num_head of key and value in attention for llama 70B is different with num_attention_head (you can check it from config.json in model uploaded by meta). That's why in transformers there is new function named repeat_kv to accomodate this. Exllama and GPTQ not yet done it.

Same here on an A100 80gb.

Yes, you need to update Transformers to the latest version. I should have mentioned that in the README, but it was already 4am and I forgot.

Please run:

pip3 install git+https://github.com/huggingface/transformers

and try again.

There is an architectural change to 70b, yes. They added grouped-query attention which needs to be added to ExLlama. It's not a big change, though, and I'm on it, so be patient. Downloading all these models takes a while. And yes, 7b and 13b don't have this change.

Great, looking forward to! GfL and AutoGPTQ are slow as shit with this ;)

These turn around times are amazing guys, it looks like llama2 support was added to ExLlama. What's that, 24 hours since the OG model dropped?

Awesome! Can confirm that after updating text-generation-webui and updating pip deps, ExLlama loader worked! Thanks, everyone!

Sign up or log in to comment