alexmarques commited on
Commit
f25d875
·
verified ·
1 Parent(s): c5e2332

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -7
README.md CHANGED
@@ -30,7 +30,8 @@ This model was obtained by quantizing the weights of [gemma-2-2b-it](https://hug
30
  This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
31
 
32
  Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT8 and floating point representations of the quantized weights.
33
- The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library. GPTQ used a 1% damping factor and 256 sequences of 8,192 random tokens.
 
34
 
35
  ## Deployment
36
 
@@ -70,10 +71,9 @@ This model was created by using the [llm-compressor](https://github.com/vllm-pro
70
 
71
  ```python
72
  from transformers import AutoTokenizer
73
- from datasets import Dataset
74
  from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
75
  from llmcompressor.modifiers.quantization import GPTQModifier
76
- import random
77
 
78
  model_id = "google/gemma-2-2b-it"
79
 
@@ -82,10 +82,12 @@ max_seq_len = 8192
82
 
83
  tokenizer = AutoTokenizer.from_pretrained(model_id)
84
 
85
- max_token_id = len(tokenizer.get_vocab()) - 1
86
- input_ids = [[random.randint(0, max_token_id) for _ in range(max_seq_len)] for _ in range(num_samples)]
87
- attention_mask = num_samples * [max_seq_len * [1]]
88
- ds = Dataset.from_dict({"input_ids": input_ids, "attention_mask": attention_mask})
 
 
89
 
90
  recipe = GPTQModifier(
91
  targets="Linear",
 
30
  This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
31
 
32
  Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT8 and floating point representations of the quantized weights.
33
+ The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
34
+ GPTQ used a 1% damping factor and 256 sequences sequences taken from Neural Magic's [LLM compression calibration dataset](https://huggingface.co/datasets/neuralmagic/LLM_compression_calibration).
35
 
36
  ## Deployment
37
 
 
71
 
72
  ```python
73
  from transformers import AutoTokenizer
74
+ from datasets import load_dataset
75
  from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
76
  from llmcompressor.modifiers.quantization import GPTQModifier
 
77
 
78
  model_id = "google/gemma-2-2b-it"
79
 
 
82
 
83
  tokenizer = AutoTokenizer.from_pretrained(model_id)
84
 
85
+ def preprocess_fn(example):
86
+ return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
87
+
88
+ ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
89
+ ds = ds.shuffle().select(range(num_samples))
90
+ ds = ds.map(preprocess_fn)
91
 
92
  recipe = GPTQModifier(
93
  targets="Linear",