neuralmagic
/

gemma-2-2b-it-quantized.w8a16

Text Generation

compressed-tensors

Model card Files Files and versions Community

alexmarques commited on Aug 16, 2024

Commit

f25d875

·

verified ·

1 Parent(s): c5e2332

Update README.md

Files changed (1) hide show

README.md +9 -7

README.md CHANGED Viewed

@@ -30,7 +30,8 @@ This model was obtained by quantizing the weights of [gemma-2-2b-it](https://hug
 This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
 Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT8 and floating point representations of the quantized weights.
-The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library. GPTQ used a 1% damping factor and 256 sequences of 8,192 random tokens.
 ## Deployment
@@ -70,10 +71,9 @@ This model was created by using the [llm-compressor](https://github.com/vllm-pro
 ```python
 from transformers import AutoTokenizer
-from datasets import Dataset
 from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
 from llmcompressor.modifiers.quantization import GPTQModifier
-import random
 model_id = "google/gemma-2-2b-it"
@@ -82,10 +82,12 @@ max_seq_len = 8192
 tokenizer = AutoTokenizer.from_pretrained(model_id)
-max_token_id = len(tokenizer.get_vocab()) - 1
-input_ids = [[random.randint(0, max_token_id) for _ in range(max_seq_len)] for _ in range(num_samples)]
-attention_mask = num_samples * [max_seq_len * [1]]
-ds = Dataset.from_dict({"input_ids": input_ids, "attention_mask": attention_mask})
 recipe = GPTQModifier(
   targets="Linear",

 This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
 Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT8 and floating point representations of the quantized weights.
+The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
+GPTQ used a 1% damping factor and 256 sequences sequences taken from Neural Magic's [LLM compression calibration dataset](https://huggingface.co/datasets/neuralmagic/LLM_compression_calibration).
 ## Deployment
 ```python
 from transformers import AutoTokenizer
+from datasets import load_dataset
 from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
 from llmcompressor.modifiers.quantization import GPTQModifier
 model_id = "google/gemma-2-2b-it"
 tokenizer = AutoTokenizer.from_pretrained(model_id)
+def preprocess_fn(example):
+  return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
+ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
+ds = ds.shuffle().select(range(num_samples))
+ds = ds.map(preprocess_fn)
 recipe = GPTQModifier(
   targets="Linear",