File size: 5,123 Bytes

50abae0
 
 
 
 
 
f2c2ff4
 
50abae0
 
 
 
 
 
 
 
1cfd3f7
d114d94
1cfd3f7
d114d94
50abae0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d114d94
 
1cfd3f7
 
50abae0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f2c2ff4
 
50abae0
f2c2ff4
50abae0
f2c2ff4
7e75399
a582b8e
 
 
c088c6c
 
 
7e75399
 
 
 
 
 
 
 
 
 
 
 
 
de2cb9d
8c8ba18
de2cb9d
8c8ba18
 
 
 
 
 
 
 
 
ba08dcc
 
8c8ba18
c4c3925
8c8ba18
 
 
 
 
 
 
 
8ce20ba

---
license: llama3.1
datasets:
- nvidia/OpenMathInstruct-2
language:
- en
metrics:
- accuracy
base_model:
- meta-llama/Llama-3.1-8B-Instruct
model-index:
- name: Control-LLM-Llama3.1-8B-Math16
  results:
  - task:
      type: math-evaluation
    dataset:
      type: parquet
      name: Math, Math Hard, GSM8K
      dataset_kwargs:
        data_files: "https://github.com/linkedin/ControlLLM/blob/main/src/controlllm/inference/llm_eval_harness/additional_tasks/math/joined_math.parquet"
    metrics:
    - name: exact_match,none
      type: exact_match
      value: 0.6327358367133324
      stderr: 0.0052245703347459605
      verified: false
    - name: exact_match,none (gsm8k_0shot_instruct)
      type: exact_match
      value: 0.9052312357846853
      stderr: 0.008067791560015407
      verified: false
    - name: exact_match,none (meta_math_0shot_instruct)
      type: exact_match
      value: 0.6276
      stderr: 0.006837616441401548
      verified: false
    - name: exact_match,none (meta_math_hard_0shot_instruct)
      type: exact_match
      value: 0.3806646525679758
      stderr: 0.013349170720370741
      verified: false
  - task:
      type: original-capability
    dataset:
      type: meta/Llama-3.1-8B-Instruct-evals
      name: Llama-3.1-8B-Instruct-evals Dataset
      dataset_path: "meta-llama/llama-3.1-8_b-instruct-evals"
      dataset_name: "Llama-3.1-8B-Instruct-evals__arc_challenge__details"
    metrics:
    - name: exact_match,strict-match
      type: exact_match
      value: 0.5723263625528227
      stderr: 0.002858377993520894
      verified: false
    - name: exact_match,strict-match (meta_arc_0shot_instruct)
      type: exact_match
      value: 0.7974248927038626
      stderr: 0.01178043813618557
      verified: false
    - name: exact_match,strict-match (meta_gpqa_0shot_cot_instruct)
      type: exact_match
      value: 0.25223214285714285
      stderr: 0.02054139101648797
      verified: false
    - name: exact_match,strict-match (meta_mmlu_0shot_instruct)
      type: exact_match
      value: 0.6837345107534539
      stderr: 0.0039243761987253515
      verified: false
    - name: exact_match,strict-match (meta_mmlu_pro_5shot_instruct)
      type: exact_match
      value: 0.4324301861702128
      stderr: 0.004516653585262379
      verified: false
pipeline_tag: text-generation
library_name: transformers
---

# Control-LLM-Llama3.1-8B-Math16
This is a fine-tuned model of Llama-3.1-8B-Instruct for mathematical tasks on OpenMath2 dataset, as described in the paper [Control LLM: Controlled Evolution for Intelligence Retention in LLM](https://huggingface.co/papers/2501.10979).

## Linked Paper
This model is associated with the paper: [Control-LLM](https://arxiv.org/abs/2501.10979).

## Linked Open Source code - training, eval and benchmark
This model is associated with the github: [Control-LLM](https://github.com/linkedin/ControlLLM).

## Evaluation Results
Here is an overview of the evaluation results and findings:

### Benchmark Result and Catastrophic Forgetting on OpenMath
The following plot illustrates benchmark result and catastrophic forgetting mitigation on the OpenMath2 dataset.

![Catastrophic Forgetting](plots/catastrophic_forgetting_openmath.png)

### Alignment Comparison
The plot below highlights the alignment comparison of the model trained with Control LLM and Full Parameter Tuning.

![Alignment Comparison](plots/alignment_comparison.png)

### Benchmark Results Table
The table below summarizes evaluation results across mathematical tasks and original capabilities.

| **Model**         | **MH** | **M**  | **G8K** | **M-Avg** | **ARC** | **GPQA** | **MLU** | **MLUP** | **O-Avg** | **Overall** |
|-------------------|--------|--------|---------|-----------|---------|----------|---------|----------|-----------|-------------|
| Llama3.1-8B-Inst  | 23.7   | 50.9   | 85.6    | 52.1      | 83.4    | 29.9     | 72.4    | 46.7     | 60.5      | 56.3        |
| OpenMath2-Llama3  | 38.4   | 64.1   | 90.3    | 64.3      | 45.8    | 1.3      | 4.5     | 19.5     | 12.9      | 38.6        |
| **Full Tune**      | **38.5**| **63.7**| 90.2    | **63.9**  | 58.2    | 1.1      | 7.3     | 23.5     | 16.5      | 40.1        |
| Partial Tune      | 36.4   | 61.4   | 89.0    | 61.8      | 66.2    | 6.0      | 25.7    | 30.9     | 29.3      | 45.6        |
| Stack Exp.        | 35.6   | 61.0   | 90.8    | 61.8      | 69.3    | 18.8     | 61.8    | 43.1     | 53.3      | 57.6        |
| Hybrid Exp.       | 34.4   | 61.1   | 90.1    | 61.5      | **81.8**| **25.9** | 67.2    | **43.9** | 57.1      | 59.3        |
| **Control LLM***   | 38.1   | 62.7   | **90.4**| 63.2      | 79.7    | 25.2     | **68.1**| 43.6     | **57.2**  | **60.2**    |

---
### Explanation:
- **MH**: MathHard
- **M**: Math
- **G8K**: GSM8K
- **M-Avg**: Math - Average across MathHard, Math, and GSM8K
- **ARC**: ARC benchmark
- **GPQA**: General knowledge QA
- **MLU**: MMLU (Massive Multitask Language Understanding)
- **MLUP**: MMLU Pro
- **O-Avg**: Orginal Capability - Average across ARC, GPQA, MMLU, and MMLUP
- **Overall**: Combined average across all tasks