ekurtic nm-research commited on
Commit
9cb69ae
·
verified ·
1 Parent(s): fbf18c2

Add OpenLLM Leaderboard V1 and V2 evals (#1)

Browse files

- Add OpenLLM Leaderboard V1 and V2 evals (632548784324a4fe188f974229181291dfb9bdca)


Co-authored-by: Neural Magic Research <[email protected]>

Files changed (1) hide show
  1. README.md +55 -33
README.md CHANGED
@@ -1,14 +1,16 @@
 
 
1
  ---
2
  license: apache-2.0
3
- datasets:
4
- - openai/gsm8k
5
  language:
6
  - en
7
  tags:
8
- - nvidia
9
  - mistral-small
10
  - fp8
11
  - vllm
 
 
12
  ---
13
 
14
  # Mistral-Small-24B-Instruct-2501-FP8-Dynamic
@@ -25,12 +27,12 @@ tags:
25
  - **Model Developers:** Neural Magic
26
 
27
  Quantized version of [Mistral-Small-24B-Instruct-2501](https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501).
28
- It achieves a flexible-extract filter score of 0.9030 on the evaluated on [GSM8k](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/gsm8k) task, where as the unquantized model achieves a flexible-extract filter score of 0.9060.
29
 
30
  ### Model Optimizations
31
 
32
- This model was obtained by quantizing the weights and activations to FP8 data type, ready for inference with vLLM >= 0.5.2.
33
- This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks are quantized.
34
 
35
  ## Deployment
36
 
@@ -43,7 +45,7 @@ from transformers import AutoTokenizer
43
  from vllm import LLM, SamplingParams
44
 
45
  max_model_len, tp_size = 4096, 1
46
- model_name = "nm-testing/Mistral-Small-24B-Instruct-2501-FP8-Dynamic"
47
  tokenizer = AutoTokenizer.from_pretrained(model_name)
48
  llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
49
  sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
@@ -64,7 +66,7 @@ vLLM also supports OpenAI-compatible serving. See the [documentation](https://do
64
 
65
  ## Creation
66
 
67
- This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
68
 
69
 
70
  ```python
@@ -77,7 +79,7 @@ import os
77
  def main():
78
  parser = argparse.ArgumentParser(description='Quantize a transformer model to FP8')
79
  parser.add_argument('--model_id', type=str, required=True,
80
- help='The model ID from HuggingFace (e.g., "mistralai/Mistral-Small-24B-Instruct-2501")')
81
  parser.add_argument('--save_path', type=str, default='.',
82
  help='Custom path to save the quantized model. If not provided, will use model_name-FP8-dynamic')
83
  args = parser.parse_args()
@@ -110,41 +112,61 @@ if __name__ == "__main__":
110
 
111
  ## Evaluation
112
 
113
- The optimized model was evaluated on GSM8k task with the flexible-extract filter score of 0.9030 ± 0.0082, and strict-match filter score of 0.8976 ± 0.0083, where as the unquantized model with the flexible-extract filter score of 0.9060 ± .0080, and strict-match filter score of 0.8992 ± 0.0083.
114
-
115
- Evaluations were carried out using the following commands.
116
 
117
- For the quantized model:
118
  ```
119
  lm_eval \
120
  --model vllm \
121
- --model_args pretrained="nm-testing/Mistral-Small-24B-Instruct-2501-FP8-Dynamic",add_bos_token=True \
122
- --tasks gsm8k \
123
- --batch_size auto
 
 
 
124
  ```
125
 
126
- For the unquantized model
127
  ```
128
  lm_eval \
129
  --model vllm \
130
- --model_args pretrained="mistralai/Mistral-Small-24B-Instruct-2501",add_bos_token=True \
131
- --tasks gsm8k \
132
- --batch_size auto
 
 
 
 
 
133
 
134
  ```
135
 
136
  ### Accuracy
137
 
138
- #### GSM8k evaluation scores for the optimized model
139
-
140
- |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
141
- |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
142
- |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9030|± |0.0082|
143
- | | |strict-match | 5|exact_match|↑ |0.8976|± |0.0083|
144
-
145
- #### GSM8k evaluation scores for the unquantized model
146
-
147
- |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
148
- |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
149
- |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9060|± |0.0080|
150
- | | |strict-match | 5|exact_match|↑ |0.8992|± |0.0083|
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+
3
  ---
4
  license: apache-2.0
 
 
5
  language:
6
  - en
7
  tags:
8
+ - mistral
9
  - mistral-small
10
  - fp8
11
  - vllm
12
+ base_model: mistralai/Mistral-Small-24B-Instruct-2501
13
+ library_name: transformers
14
  ---
15
 
16
  # Mistral-Small-24B-Instruct-2501-FP8-Dynamic
 
27
  - **Model Developers:** Neural Magic
28
 
29
  Quantized version of [Mistral-Small-24B-Instruct-2501](https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501).
30
+ It achieves an average score of 78.88 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 79.45.
31
 
32
  ### Model Optimizations
33
 
34
+ This model was obtained by quantizing the weights and activations to FP8 data type, ready for inference with vLLM.
35
+ This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks are quantized.
36
 
37
  ## Deployment
38
 
 
45
  from vllm import LLM, SamplingParams
46
 
47
  max_model_len, tp_size = 4096, 1
48
+ model_name = "neuralmagic/Mistral-Small-24B-Instruct-2501-FP8-Dynamic"
49
  tokenizer = AutoTokenizer.from_pretrained(model_name)
50
  llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
51
  sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
 
66
 
67
  ## Creation
68
 
69
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
70
 
71
 
72
  ```python
 
79
  def main():
80
  parser = argparse.ArgumentParser(description='Quantize a transformer model to FP8')
81
  parser.add_argument('--model_id', type=str, required=True,
82
+ help='The model ID from HuggingFace (e.g., "meta-llama/Meta-Llama-3-8B-Instruct")')
83
  parser.add_argument('--save_path', type=str, default='.',
84
  help='Custom path to save the quantized model. If not provided, will use model_name-FP8-dynamic')
85
  args = parser.parse_args()
 
112
 
113
  ## Evaluation
114
 
115
+ The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard) and [V2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/), using the following commands:
 
 
116
 
117
+ OpenLLM Leaderboard V1:
118
  ```
119
  lm_eval \
120
  --model vllm \
121
+ --model_args pretrained="neuralmagic/Mistral-Small-24B-Instruct-2501-FP8-Dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
122
+ --tasks openllm \
123
+ --write_out \
124
+ --batch_size auto \
125
+ --output_path output_dir \
126
+ --show_config
127
  ```
128
 
129
+ OpenLLM Leaderboard V2:
130
  ```
131
  lm_eval \
132
  --model vllm \
133
+ --model_args pretrained="neuralmagic/Mistral-Small-24B-Instruct-2501-FP8-Dynamic",dtype=auto,add_bos_token=False,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
134
+ --apply_chat_template \
135
+ --fewshot_as_multiturn \
136
+ --tasks leaderboard \
137
+ --write_out \
138
+ --batch_size auto \
139
+ --output_path output_dir \
140
+ --show_config
141
 
142
  ```
143
 
144
  ### Accuracy
145
 
146
+ #### OpenLLM Leaderboard V1 evaluation scores
147
+
148
+ | Metric | mistralai/Mistral-Small-24B-Instruct-2501 | nm-testing/Mistral-Small-24B-Instruct-2501-FP8-dynamic |
149
+ |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
150
+ | ARC-Challenge (Acc-Norm, 25-shot) | 72.18 | 71.76 |
151
+ | GSM8K (Strict-Match, 5-shot) | 90.14 | 89.01 |
152
+ | HellaSwag (Acc-Norm, 10-shot) | 85.05 | 84.65 |
153
+ | MMLU (Acc, 5-shot) | 80.69 | 80.55 |
154
+ | TruthfulQA (MC2, 0-shot) | 65.55 | 64.85 |
155
+ | Winogrande (Acc, 5-shot) | 83.11 | 82.48 |
156
+ | **Average Score** | **79.45** | **78.88** |
157
+ | **Recovery (%)** | **100.00** | **99.28** |
158
+
159
+ #### OpenLLM Leaderboard V2 evaluation scores
160
+
161
+ | Metric | mistralai/Mistral-Small-24B-Instruct-2501 | nm-testing/Mistral-Small-24B-Instruct-2501-FP8-dynamic |
162
+ |---------------------------------------------------------|:---------------------------------:|:-------------------------------------------:|
163
+ | IFEval (Inst-and-Prompt Level Strict Acc, 0-shot) | 73.27 | 73.53 |
164
+ | BBH (Acc-Norm, 3-shot) | 45.18 | 44.39 |
165
+ | MMLU-Pro (Acc, 5-shot) | 38.83 | 37.28 |
166
+ | **Average Score** | **52.42** | **51.73** |
167
+ | **Recovery (%)** | **100.00** | **98.68** |
168
+ | Math-Hard (Exact-Match, 4-shot) | 6.35 | 2.99 |
169
+ | GPQA (Acc-Norm, 0-shot) | 8.29 | 6.97 |
170
+ | MUSR (Acc-Norm, 0-shot) | 7.84 | 8.04 |
171
+
172
+ Results on Math-Hard, GPQA, and MUSR are not considred for accuracy recovery calculation because the unquantized model has close to random prediction accuracy (6.35, 8.29, 7.84) which doesn't provide a reliable baseline for recovery calculation.