metadata

license: llama3.1
datasets:
  - OpenCoder-LLM/opc-sft-stage1
  - OpenCoder-LLM/opc-sft-stage2
language:
  - en
base_model:
  - meta-llama/Llama-3.1-8B-Instruct
model-index:
  - name: Control-LLM-Llama3.1-8B-OpenCoder8
    results:
      - task:
          type: code-evaluation
        dataset:
          type: mixed
          name: Code Evaluation Dataset
        metrics:
          - name: pass_at_1,n=1 (code_instruct)
            type: pass_at_1
            value: 0.770508826583593
            stderr: 0.013547264970313243
            verified: false
          - name: pass_at_1,n=1 (humaneval_greedy_instruct)
            type: pass_at_1
            value: 0.823170731707317
            stderr: 0.029883277857485988
            verified: false
          - name: pass_at_1,n=1 (humaneval_plus_greedy_instruct)
            type: pass_at_1
            value: 0.7621951219512195
            stderr: 0.033346454086653404
            verified: false
          - name: pass_at_1,n=1 (mbpp_plus_0shot_instruct)
            type: pass_at_1
            value: 0.7751322751322751
            stderr: 0.02150209607822914
            verified: false
          - name: pass_at_1,n=1 (mbpp_sanitized_0shot_instruct)
            type: pass_at_1
            value: 0.7354085603112841
            stderr: 0.027569713464529938
            verified: false
      - task:
          type: original-capability
        dataset:
          type: meta/Llama-3.1-8B-Instruct-evals
          name: Llama-3.1-8B-Instruct-evals Dataset
          dataset_path: meta-llama/llama-3.1-8_b-instruct-evals
          dataset_name: Llama-3.1-8B-Instruct-evals__arc_challenge__details
        metrics:
          - name: exact_match,strict-match (original_capability_instruct)
            type: exact_match
            value: 0.5599378769819771
            stderr: 0.0028491774433443513
            verified: false
          - name: exact_match,strict-match (meta_arc_0shot_instruct)
            type: exact_match
            value: 0.8094420600858369
            stderr: 0.011511446994122106
            verified: false
          - name: exact_match,strict-match (meta_gpqa_0shot_cot_instruct)
            type: exact_match
            value: 0.32589285714285715
            stderr: 0.02216910313464341
            verified: false
          - name: exact_match,strict-match (meta_mmlu_0shot_instruct)
            type: exact_match
            value: 0.681241988320752
            stderr: 0.003932622311434926
            verified: false
          - name: exact_match,strict-match (meta_mmlu_pro_5shot_instruct)
            type: exact_match
            value: 0.4029255319148936
            stderr: 0.004471732136513382
            verified: false
pipeline_tag: text-generation
library_name: transformers

Control-LLM-Llama3.1-8B-OpenCoder8

This is a fine-tuned model of Llama-3.1-8B-Instruct for coding tasks on OpenCoder SFT dataset described in the paper: Control LLM: Controlled Evolution for Intelligence Retention in LLM.

Code: https://github.com/linkedin/ControlLLM.

Linked Open Source code - training, eval and benchmark

This model is associated with the github: Control-LLM.

Evaluation Results

Here is an overview of the evaluation results and findings:

Hybrid Expansion on OpenCoder

The following diagram illustrates how hybrid expansion works.

Benchmark Results Table

The table below summarizes evaluation results across coding tasks and original capabilities.

Model	MB+	MS	HE+	HE	C-Avg	ARC	GP	MLU	MLUP	O-Avg	Overall
Llama3.1-8B-Ins	70.4	67.7	66.5	70.7	69.1	83.4	29.9	72.4	46.7	60.5	64.8
OpenCoder-8B-Ins	81.2	76.3	78.0	82.3	79.5	8.2	25.4	37.4	11.3	24.6	52.1
Full Param Tune	75.1	69.6	71.3	76.8	73.3	24.4	21.9	43.0	19.2	31.5	52.4
Partial Param Tune	75.7	71.6	74.4	79.3	75.0	70.2	28.1	60.7	32.4	48.3	61.7
Stack Expansion	77.2	72.8	73.2	78.7	75.6	80.0	26.3	66.6	38.2	54.2	64.9
ControlLLM-Hybrid	77.5	73.5	76.2	82.3	77.1	80.9	32.6	68.1	40.3	56.0	66.6

Explanation:

MB+: MBPP Plus
MS: MBPP Sanitized
HE+: HumanEval Plus
HE: HumanEval
C-Avg: Coding - Size Weighted Average across MB+, MS, HE+, and HE
ARC: ARC benchmark
GP: GPQA benchmark
MLU: MMLU (Massive Multitask Language Understanding)
MLUP: MMLU Pro
O-Avg: Original Capability - Size Weighted Average across ARC, GPQA, MMLU, and MMLU Pro
Overall: Combined average across all tasks