metadata

license: llama3.1
datasets:
  - nvidia/OpenMathInstruct-2
language:
  - en
metrics:
  - accuracy
base_model:
  - meta-llama/Llama-3.1-8B-Instruct
model-index:
  - name: Control-LLM-Llama3.1-8B-Math16
    results:
      - task:
          type: math-evaluation
        dataset:
          type: parquet
          name: Math, Math Hard, GSM8K
          dataset_kwargs:
            data_files: >-
              https://github.com/linkedin/ControlLLM/blob/main/src/controlllm/inference/llm_eval_harness/additional_tasks/math/joined_math.parquet
        metrics:
          - name: exact_match,none
            type: exact_match
            value: 0.6327358367133324
            stderr: 0.0052245703347459605
            verified: false
          - name: exact_match,none (gsm8k_0shot_instruct)
            type: exact_match
            value: 0.9052312357846853
            stderr: 0.008067791560015407
            verified: false
          - name: exact_match,none (meta_math_0shot_instruct)
            type: exact_match
            value: 0.6276
            stderr: 0.006837616441401548
            verified: false
          - name: exact_match,none (meta_math_hard_0shot_instruct)
            type: exact_match
            value: 0.3806646525679758
            stderr: 0.013349170720370741
            verified: false
      - task:
          type: original-capability
        dataset:
          type: meta/Llama-3.1-8B-Instruct-evals
          name: Llama-3.1-8B-Instruct-evals Dataset
          dataset_path: meta-llama/llama-3.1-8_b-instruct-evals
          dataset_name: Llama-3.1-8B-Instruct-evals__arc_challenge__details
        metrics:
          - name: exact_match,strict-match
            type: exact_match
            value: 0.5723263625528227
            stderr: 0.002858377993520894
            verified: false
          - name: exact_match,strict-match (meta_arc_0shot_instruct)
            type: exact_match
            value: 0.7974248927038626
            stderr: 0.01178043813618557
            verified: false
          - name: exact_match,strict-match (meta_gpqa_0shot_cot_instruct)
            type: exact_match
            value: 0.25223214285714285
            stderr: 0.02054139101648797
            verified: false
          - name: exact_match,strict-match (meta_mmlu_0shot_instruct)
            type: exact_match
            value: 0.6837345107534539
            stderr: 0.0039243761987253515
            verified: false
          - name: exact_match,strict-match (meta_mmlu_pro_5shot_instruct)
            type: exact_match
            value: 0.4324301861702128
            stderr: 0.004516653585262379
            verified: false
pipeline_tag: text-generation
library_name: transformers

Control-LLM-Llama3.1-8B-Math16

This is a fine-tuned model of Llama-3.1-8B-Instruct for mathematical tasks on OpenMath2 dataset, as described in the paper Control LLM: Controlled Evolution for Intelligence Retention in LLM.

Linked Paper

This model is associated with the paper: Control-LLM.

Linked Open Source code - training, eval and benchmark

This model is associated with the github: Control-LLM.

Evaluation Results

Here is an overview of the evaluation results and findings:

Benchmark Result and Catastrophic Forgetting on OpenMath

The following plot illustrates benchmark result and catastrophic forgetting mitigation on the OpenMath2 dataset.

Alignment Comparison

The plot below highlights the alignment comparison of the model trained with Control LLM and Full Parameter Tuning.

Benchmark Results Table

The table below summarizes evaluation results across mathematical tasks and original capabilities.

Model	MH	M	G8K	M-Avg	ARC	GPQA	MLU	MLUP	O-Avg	Overall
Llama3.1-8B-Inst	23.7	50.9	85.6	52.1	83.4	29.9	72.4	46.7	60.5	56.3
OpenMath2-Llama3	38.4	64.1	90.3	64.3	45.8	1.3	4.5	19.5	12.9	38.6
Full Tune	38.5	63.7	90.2	63.9	58.2	1.1	7.3	23.5	16.5	40.1
Partial Tune	36.4	61.4	89.0	61.8	66.2	6.0	25.7	30.9	29.3	45.6
Stack Exp.	35.6	61.0	90.8	61.8	69.3	18.8	61.8	43.1	53.3	57.6
Hybrid Exp.	34.4	61.1	90.1	61.5	81.8	25.9	67.2	43.9	57.1	59.3
Control LLM*	38.1	62.7	90.4	63.2	79.7	25.2	68.1	43.6	57.2	60.2

Explanation:

MH: MathHard
M: Math
G8K: GSM8K
M-Avg: Math - Average across MathHard, Math, and GSM8K
ARC: ARC benchmark
GPQA: General knowledge QA
MLU: MMLU (Massive Multitask Language Understanding)
MLUP: MMLU Pro
O-Avg: Orginal Capability - Average across ARC, GPQA, MMLU, and MMLUP
Overall: Combined average across all tasks