leaderboard-pr-bot's picture
Adding Evaluation Results
c9573a5 verified
|
raw
history blame
5.35 kB
metadata
license: mit
library_name: transformers
datasets:
  - teknium/OpenHermes-2.5
model-index:
  - name: phi-2-OpenHermes-2.5
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 59.81
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=g-ronimo/phi-2-OpenHermes-2.5
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 74.85
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=g-ronimo/phi-2-OpenHermes-2.5
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 55.51
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=g-ronimo/phi-2-OpenHermes-2.5
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 43.86
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=g-ronimo/phi-2-OpenHermes-2.5
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 75.06
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=g-ronimo/phi-2-OpenHermes-2.5
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 41.17
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=g-ronimo/phi-2-OpenHermes-2.5
          name: Open LLM Leaderboard

microsoft/phi-2 + teknium/OpenHermes-2.5

Training

  • QLoRA rank 32, LR 2e-5, 1 epoch
  • effective batch size: 200
  • max. seq. length: 1024 tokens
  • code in code/

Evals

Model AGIEval GPT4All TruthfulQA Bigbench Average
g-ronimo/phi-2-OpenHermes-2.5 30.27 71.18 43.87 35.9 45.3
minghaowu/phi-2-OpenHermes-2.5 27.95 67.55 48.07 36.17 44.94
phi-2 27.96 70.84 44.46 35.17 44.61

Inference

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

modelpath="g-ronimo/phi-2-OpenHermes-2.5"

model = AutoModelForCausalLM.from_pretrained(
    modelpath,    
    torch_dtype=torch.bfloat16,
    device_map="auto",
    # attn_implementation="flash_attention_2",
)
tokenizer = AutoTokenizer.from_pretrained(modelpath) 

messages = [
    {"role": "system", "content": "answer like a pirate"},
    {"role": "user", "content": "what does it mean to be successful?"},
]
        
input_tokens = tokenizer.apply_chat_template(
    messages, 
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda")
output_tokens = model.generate(input_tokens, max_new_tokens=500)
output = tokenizer.decode(output_tokens[0])

print(output)

Ahoy there, matey! To me, being successful means having the wind in your sails and reaching the treasure you've been dreaming of. It's about setting sail on a journey with clear goals, working hard, facing challenges head-on, and never losing sight of what truly matters. So, set your compass right, hoist your Jolly Roger high, and let's embark on this adventure together! ⚓️💰⛵️

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 58.38
AI2 Reasoning Challenge (25-Shot) 59.81
HellaSwag (10-Shot) 74.85
MMLU (5-Shot) 55.51
TruthfulQA (0-shot) 43.86
Winogrande (5-shot) 75.06
GSM8k (5-shot) 41.17