neuralmagic
/

Mistral-Small-24B-Instruct-2501-FP8-Dynamic

Text Generation

Transformers

Safetensors

text-generation-inference

Inference Endpoints

compressed-tensors

Model card Files Files and versions Community

ekurtic

nm-research commited on about 12 hours ago

Commit

9cb69ae

verified ·

1 Parent(s): fbf18c2

Add OpenLLM Leaderboard V1 and V2 evals (#1)

Browse files

- Add OpenLLM Leaderboard V1 and V2 evals (632548784324a4fe188f974229181291dfb9bdca)

Co-authored-by: Neural Magic Research <[email protected]>

Files changed (1) hide show

README.md +55 -33

README.md CHANGED Viewed

@@ -1,14 +1,16 @@
 ---
 license: apache-2.0
-datasets:
-- openai/gsm8k
 language:
 - en
 tags:
-- nvidia
 - mistral-small
 - fp8
 - vllm
 ---
 # Mistral-Small-24B-Instruct-2501-FP8-Dynamic
@@ -25,12 +27,12 @@ tags:
 - **Model Developers:** Neural Magic
 Quantized version of [Mistral-Small-24B-Instruct-2501](https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501).
-It achieves a flexible-extract filter score of 0.9030 on the evaluated on [GSM8k](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/gsm8k) task, where as the unquantized model achieves a flexible-extract filter score of 0.9060.
 ### Model Optimizations
-This model was obtained by quantizing the weights and activations to FP8 data type, ready for inference with vLLM >= 0.5.2.
-This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks are quantized.
 ## Deployment
@@ -43,7 +45,7 @@ from transformers import AutoTokenizer
 from vllm import LLM, SamplingParams
 max_model_len, tp_size = 4096, 1
-model_name = "nm-testing/Mistral-Small-24B-Instruct-2501-FP8-Dynamic"
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
 sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
@@ -64,7 +66,7 @@ vLLM also supports OpenAI-compatible serving. See the [documentation](https://do
 ## Creation
-This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
 ```python
@@ -77,7 +79,7 @@ import os
 def main():
     parser = argparse.ArgumentParser(description='Quantize a transformer model to FP8')
     parser.add_argument('--model_id', type=str, required=True,
-                        help='The model ID from HuggingFace (e.g., "mistralai/Mistral-Small-24B-Instruct-2501")')
     parser.add_argument('--save_path', type=str, default='.',
                         help='Custom path to save the quantized model. If not provided, will use model_name-FP8-dynamic')
     args = parser.parse_args()
@@ -110,41 +112,61 @@ if __name__ == "__main__":
 ## Evaluation
-The optimized model was evaluated on GSM8k task with the flexible-extract filter score of 0.9030 ± 0.0082, and strict-match filter score of 0.8976 ± 0.0083, where as the unquantized model with the flexible-extract filter score of 0.9060 ± .0080, and strict-match filter score of 0.8992 ± 0.0083.
-Evaluations were carried out using the following commands.
-For the quantized model:
 ```
 lm_eval \
   --model vllm \
-  --model_args pretrained="nm-testing/Mistral-Small-24B-Instruct-2501-FP8-Dynamic",add_bos_token=True \
-  --tasks gsm8k \
-  --batch_size auto
 ```
-For the unquantized model
 ```
 lm_eval \
   --model vllm \
-  --model_args pretrained="mistralai/Mistral-Small-24B-Instruct-2501",add_bos_token=True \
-  --tasks gsm8k \
-  --batch_size auto
 ```
 ### Accuracy
-#### GSM8k evaluation scores for the optimized model
-|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
-|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
-|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9030|±  |0.0082|
-|     |       |strict-match    |     5|exact_match|↑  |0.8976|±  |0.0083|
-#### GSM8k evaluation scores for the unquantized model
-|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
-|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
-|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9060|±  |0.0080|
-|     |       |strict-match    |     5|exact_match|↑  |0.8992|±  |0.0083|

 ---
 license: apache-2.0
 language:
 - en
 tags:
+- mistral
 - mistral-small
 - fp8
 - vllm
+base_model: mistralai/Mistral-Small-24B-Instruct-2501
+library_name: transformers
 ---
 # Mistral-Small-24B-Instruct-2501-FP8-Dynamic
 - **Model Developers:** Neural Magic
 Quantized version of [Mistral-Small-24B-Instruct-2501](https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501).
+It achieves an average score of 78.88 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 79.45.
 ### Model Optimizations
+This model was obtained by quantizing the weights and activations to FP8 data type, ready for inference with vLLM.
+This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks are quantized.
 ## Deployment
 from vllm import LLM, SamplingParams
 max_model_len, tp_size = 4096, 1
+model_name = "neuralmagic/Mistral-Small-24B-Instruct-2501-FP8-Dynamic"
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
 sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
 ## Creation
+This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
 ```python
 def main():
     parser = argparse.ArgumentParser(description='Quantize a transformer model to FP8')
     parser.add_argument('--model_id', type=str, required=True,
+                        help='The model ID from HuggingFace (e.g., "meta-llama/Meta-Llama-3-8B-Instruct")')
     parser.add_argument('--save_path', type=str, default='.',
                         help='Custom path to save the quantized model. If not provided, will use model_name-FP8-dynamic')
     args = parser.parse_args()
 ## Evaluation
+The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard) and [V2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/), using the following commands:
+OpenLLM Leaderboard V1:
 ```
 lm_eval \
   --model vllm \
+  --model_args pretrained="neuralmagic/Mistral-Small-24B-Instruct-2501-FP8-Dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
+  --tasks openllm \
+  --write_out \
+  --batch_size auto \
+  --output_path output_dir \
+  --show_config
 ```
+OpenLLM Leaderboard V2:
 ```
 lm_eval \
   --model vllm \
+  --model_args pretrained="neuralmagic/Mistral-Small-24B-Instruct-2501-FP8-Dynamic",dtype=auto,add_bos_token=False,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
+  --apply_chat_template \
+  --fewshot_as_multiturn \
+  --tasks leaderboard \
+  --write_out \
+  --batch_size auto \
+  --output_path output_dir \
+  --show_config
 ```
 ### Accuracy
+#### OpenLLM Leaderboard V1 evaluation scores
+| Metric                                   | mistralai/Mistral-Small-24B-Instruct-2501             | nm-testing/Mistral-Small-24B-Instruct-2501-FP8-dynamic |
+|-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
+| ARC-Challenge (Acc-Norm, 25-shot)       |  72.18                            | 71.76                                       |
+| GSM8K (Strict-Match, 5-shot)            |  90.14                            | 89.01                                        |
+| HellaSwag (Acc-Norm, 10-shot)           |  85.05                            | 84.65                                       |
+| MMLU (Acc, 5-shot)                      |  80.69                            | 80.55                                       |
+| TruthfulQA (MC2, 0-shot)                |  65.55                            | 64.85                                       |
+| Winogrande (Acc, 5-shot)                |  83.11                            | 82.48                                       |
+| **Average Score**                       | **79.45**                        | **78.88**                                   |
+| **Recovery (%)**                            | **100.00**                       | **99.28**                                   |
+#### OpenLLM Leaderboard V2 evaluation scores
+| Metric                                                   | mistralai/Mistral-Small-24B-Instruct-2501             | nm-testing/Mistral-Small-24B-Instruct-2501-FP8-dynamic |
+|---------------------------------------------------------|:---------------------------------:|:-------------------------------------------:|
+| IFEval (Inst-and-Prompt Level Strict Acc, 0-shot)       |     73.27                        |     73.53                                   |
+| BBH (Acc-Norm, 3-shot)                                  |     45.18                        |     44.39                                   |
+| MMLU-Pro (Acc, 5-shot)                                  |      38.83                       |     37.28                                   |
+| **Average Score**                                       | **52.42**                        | **51.73**                                   |
+| **Recovery (%)**                                            | **100.00**                       | **98.68**                                   |
+| Math-Hard (Exact-Match, 4-shot)                         |      6.35                       |     2.99                                   |
+| GPQA (Acc-Norm, 0-shot)                                 |      8.29                        |    6.97                                     |
+| MUSR (Acc-Norm, 0-shot)                                 |      7.84                        |    8.04                                    |
+Results on Math-Hard, GPQA, and MUSR are not considred for accuracy recovery calculation because the unquantized model has close to random prediction accuracy (6.35, 8.29, 7.84) which doesn't provide a reliable baseline for recovery calculation.