File size: 5,123 Bytes
50abae0 f2c2ff4 50abae0 1cfd3f7 d114d94 1cfd3f7 d114d94 50abae0 d114d94 1cfd3f7 50abae0 f2c2ff4 50abae0 f2c2ff4 50abae0 f2c2ff4 7e75399 a582b8e c088c6c 7e75399 de2cb9d 8c8ba18 de2cb9d 8c8ba18 ba08dcc 8c8ba18 c4c3925 8c8ba18 8ce20ba |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
---
license: llama3.1
datasets:
- nvidia/OpenMathInstruct-2
language:
- en
metrics:
- accuracy
base_model:
- meta-llama/Llama-3.1-8B-Instruct
model-index:
- name: Control-LLM-Llama3.1-8B-Math16
results:
- task:
type: math-evaluation
dataset:
type: parquet
name: Math, Math Hard, GSM8K
dataset_kwargs:
data_files: "https://github.com/linkedin/ControlLLM/blob/main/src/controlllm/inference/llm_eval_harness/additional_tasks/math/joined_math.parquet"
metrics:
- name: exact_match,none
type: exact_match
value: 0.6327358367133324
stderr: 0.0052245703347459605
verified: false
- name: exact_match,none (gsm8k_0shot_instruct)
type: exact_match
value: 0.9052312357846853
stderr: 0.008067791560015407
verified: false
- name: exact_match,none (meta_math_0shot_instruct)
type: exact_match
value: 0.6276
stderr: 0.006837616441401548
verified: false
- name: exact_match,none (meta_math_hard_0shot_instruct)
type: exact_match
value: 0.3806646525679758
stderr: 0.013349170720370741
verified: false
- task:
type: original-capability
dataset:
type: meta/Llama-3.1-8B-Instruct-evals
name: Llama-3.1-8B-Instruct-evals Dataset
dataset_path: "meta-llama/llama-3.1-8_b-instruct-evals"
dataset_name: "Llama-3.1-8B-Instruct-evals__arc_challenge__details"
metrics:
- name: exact_match,strict-match
type: exact_match
value: 0.5723263625528227
stderr: 0.002858377993520894
verified: false
- name: exact_match,strict-match (meta_arc_0shot_instruct)
type: exact_match
value: 0.7974248927038626
stderr: 0.01178043813618557
verified: false
- name: exact_match,strict-match (meta_gpqa_0shot_cot_instruct)
type: exact_match
value: 0.25223214285714285
stderr: 0.02054139101648797
verified: false
- name: exact_match,strict-match (meta_mmlu_0shot_instruct)
type: exact_match
value: 0.6837345107534539
stderr: 0.0039243761987253515
verified: false
- name: exact_match,strict-match (meta_mmlu_pro_5shot_instruct)
type: exact_match
value: 0.4324301861702128
stderr: 0.004516653585262379
verified: false
pipeline_tag: text-generation
library_name: transformers
---
# Control-LLM-Llama3.1-8B-Math16
This is a fine-tuned model of Llama-3.1-8B-Instruct for mathematical tasks on OpenMath2 dataset, as described in the paper [Control LLM: Controlled Evolution for Intelligence Retention in LLM](https://huggingface.co/papers/2501.10979).
## Linked Paper
This model is associated with the paper: [Control-LLM](https://arxiv.org/abs/2501.10979).
## Linked Open Source code - training, eval and benchmark
This model is associated with the github: [Control-LLM](https://github.com/linkedin/ControlLLM).
## Evaluation Results
Here is an overview of the evaluation results and findings:
### Benchmark Result and Catastrophic Forgetting on OpenMath
The following plot illustrates benchmark result and catastrophic forgetting mitigation on the OpenMath2 dataset.
![Catastrophic Forgetting](plots/catastrophic_forgetting_openmath.png)
### Alignment Comparison
The plot below highlights the alignment comparison of the model trained with Control LLM and Full Parameter Tuning.
![Alignment Comparison](plots/alignment_comparison.png)
### Benchmark Results Table
The table below summarizes evaluation results across mathematical tasks and original capabilities.
| **Model** | **MH** | **M** | **G8K** | **M-Avg** | **ARC** | **GPQA** | **MLU** | **MLUP** | **O-Avg** | **Overall** |
|-------------------|--------|--------|---------|-----------|---------|----------|---------|----------|-----------|-------------|
| Llama3.1-8B-Inst | 23.7 | 50.9 | 85.6 | 52.1 | 83.4 | 29.9 | 72.4 | 46.7 | 60.5 | 56.3 |
| OpenMath2-Llama3 | 38.4 | 64.1 | 90.3 | 64.3 | 45.8 | 1.3 | 4.5 | 19.5 | 12.9 | 38.6 |
| **Full Tune** | **38.5**| **63.7**| 90.2 | **63.9** | 58.2 | 1.1 | 7.3 | 23.5 | 16.5 | 40.1 |
| Partial Tune | 36.4 | 61.4 | 89.0 | 61.8 | 66.2 | 6.0 | 25.7 | 30.9 | 29.3 | 45.6 |
| Stack Exp. | 35.6 | 61.0 | 90.8 | 61.8 | 69.3 | 18.8 | 61.8 | 43.1 | 53.3 | 57.6 |
| Hybrid Exp. | 34.4 | 61.1 | 90.1 | 61.5 | **81.8**| **25.9** | 67.2 | **43.9** | 57.1 | 59.3 |
| **Control LLM*** | 38.1 | 62.7 | **90.4**| 63.2 | 79.7 | 25.2 | **68.1**| 43.6 | **57.2** | **60.2** |
---
### Explanation:
- **MH**: MathHard
- **M**: Math
- **G8K**: GSM8K
- **M-Avg**: Math - Average across MathHard, Math, and GSM8K
- **ARC**: ARC benchmark
- **GPQA**: General knowledge QA
- **MLU**: MMLU (Massive Multitask Language Understanding)
- **MLUP**: MMLU Pro
- **O-Avg**: Orginal Capability - Average across ARC, GPQA, MMLU, and MMLUP
- **Overall**: Combined average across all tasks |