Text Generation
Transformers
Safetensors
English
Eval Results
Inference Endpoints
hawei's picture
Add missing metadata (#1)
0be5dd6 verified
metadata
license: llama3.1
datasets:
  - OpenCoder-LLM/opc-sft-stage1
  - OpenCoder-LLM/opc-sft-stage2
language:
  - en
base_model:
  - meta-llama/Llama-3.1-8B-Instruct
model-index:
  - name: Control-LLM-Llama3.1-8B-OpenCoder8
    results:
      - task:
          type: code-evaluation
        dataset:
          type: mixed
          name: Code Evaluation Dataset
        metrics:
          - name: pass_at_1,n=1 (code_instruct)
            type: pass_at_1
            value: 0.770508826583593
            stderr: 0.013547264970313243
            verified: false
          - name: pass_at_1,n=1 (humaneval_greedy_instruct)
            type: pass_at_1
            value: 0.823170731707317
            stderr: 0.029883277857485988
            verified: false
          - name: pass_at_1,n=1 (humaneval_plus_greedy_instruct)
            type: pass_at_1
            value: 0.7621951219512195
            stderr: 0.033346454086653404
            verified: false
          - name: pass_at_1,n=1 (mbpp_plus_0shot_instruct)
            type: pass_at_1
            value: 0.7751322751322751
            stderr: 0.02150209607822914
            verified: false
          - name: pass_at_1,n=1 (mbpp_sanitized_0shot_instruct)
            type: pass_at_1
            value: 0.7354085603112841
            stderr: 0.027569713464529938
            verified: false
      - task:
          type: original-capability
        dataset:
          type: meta/Llama-3.1-8B-Instruct-evals
          name: Llama-3.1-8B-Instruct-evals Dataset
          dataset_path: meta-llama/llama-3.1-8_b-instruct-evals
          dataset_name: Llama-3.1-8B-Instruct-evals__arc_challenge__details
        metrics:
          - name: exact_match,strict-match (original_capability_instruct)
            type: exact_match
            value: 0.5599378769819771
            stderr: 0.0028491774433443513
            verified: false
          - name: exact_match,strict-match (meta_arc_0shot_instruct)
            type: exact_match
            value: 0.8094420600858369
            stderr: 0.011511446994122106
            verified: false
          - name: exact_match,strict-match (meta_gpqa_0shot_cot_instruct)
            type: exact_match
            value: 0.32589285714285715
            stderr: 0.02216910313464341
            verified: false
          - name: exact_match,strict-match (meta_mmlu_0shot_instruct)
            type: exact_match
            value: 0.681241988320752
            stderr: 0.003932622311434926
            verified: false
          - name: exact_match,strict-match (meta_mmlu_pro_5shot_instruct)
            type: exact_match
            value: 0.4029255319148936
            stderr: 0.004471732136513382
            verified: false
pipeline_tag: text-generation
library_name: transformers

Control-LLM-Llama3.1-8B-OpenCoder8

This is a fine-tuned model of Llama-3.1-8B-Instruct for coding tasks on OpenCoder SFT dataset described in the paper: Control LLM: Controlled Evolution for Intelligence Retention in LLM.

Code: https://github.com/linkedin/ControlLLM.

Linked Open Source code - training, eval and benchmark

This model is associated with the github: Control-LLM.

Evaluation Results

Here is an overview of the evaluation results and findings:

Hybrid Expansion on OpenCoder

The following diagram illustrates how hybrid expansion works.

Catastrophic Forgetting

Benchmark Results Table

The table below summarizes evaluation results across coding tasks and original capabilities.

Model MB+ MS HE+ HE C-Avg ARC GP MLU MLUP O-Avg Overall
Llama3.1-8B-Ins 70.4 67.7 66.5 70.7 69.1 83.4 29.9 72.4 46.7 60.5 64.8
OpenCoder-8B-Ins 81.2 76.3 78.0 82.3 79.5 8.2 25.4 37.4 11.3 24.6 52.1
Full Param Tune 75.1 69.6 71.3 76.8 73.3 24.4 21.9 43.0 19.2 31.5 52.4
Partial Param Tune 75.7 71.6 74.4 79.3 75.0 70.2 28.1 60.7 32.4 48.3 61.7
Stack Expansion 77.2 72.8 73.2 78.7 75.6 80.0 26.3 66.6 38.2 54.2 64.9
ControlLLM-Hybrid 77.5 73.5 76.2 82.3 77.1 80.9 32.6 68.1 40.3 56.0 66.6

Explanation:

  • MB+: MBPP Plus
  • MS: MBPP Sanitized
  • HE+: HumanEval Plus
  • HE: HumanEval
  • C-Avg: Coding - Size Weighted Average across MB+, MS, HE+, and HE
  • ARC: ARC benchmark
  • GP: GPQA benchmark
  • MLU: MMLU (Massive Multitask Language Understanding)
  • MLUP: MMLU Pro
  • O-Avg: Original Capability - Size Weighted Average across ARC, GPQA, MMLU, and MMLU Pro
  • Overall: Combined average across all tasks