pythia-6.9b-HC3 / README.md
leaderboard-pr-bot's picture
Adding Evaluation Results
7cd6810
|
raw
history blame
3.76 kB
metadata
license: apache-2.0
tags:
  - generated_from_trainer
  - HC3
  - chatGPT
  - assistant
datasets:
  - pszemraj/HC3-textgen-qa
metrics:
  - accuracy
inference: false
base_model: EleutherAI/pythia-6.9b-deduped

pythia-6.9b-deduped for general QA

Open In Colab

This model is a fine-tuned version of EleutherAI/pythia-6.9b-deduped on the pszemraj/HC3-textgen-qa dataset. It achieves the following results on the evaluation set:

  • Loss: 1.2372
  • Accuracy: 0.6769
  • perplexity: 3.446

Model description

Text generation model trained on the HC3 text data of human questions + chatGPT answers.

example

Usage

Install necessary packages for inference (unless you have a big boi GPU)

pip install -U -q transformers bitsandbytes accelerate

Basic inference example:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("pszemraj/pythia-6.9b-HC3")

model = AutoModelForCausalLM.from_pretrained(
    "pszemraj/pythia-6.9b-HC3", load_in_8bit=True, device_map="auto"
)  # shards are ~4GB each, there are eight total

prompt = "I was wondering how much wood a woodchuck could chuck? <answer>"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
    **inputs, max_new_tokens=300
)  # default generation config (+ 300 tokens)
result = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
result = result.split("<end_answer>")[0].strip()

import pprint as pp

pp.pprint(result)

The defautl GenerationConfig uses contrastive search with top_k=4 and penalty_alpha=0.6. For more information on inference and parameters to use, see the transformers docs.

Intended uses & limitations

  • Intended use: research/exploration into comparing RLHF tuning vs. "guided"/specific tuning on "quality" datasets/responses of "what the human would want as answer anyway"
  • This is not trained/fine-tuned with RLHF and therefore will not be as helpful/generalizable/safe as chatGPT (outside of the fact that this model is ~30x smaller)

Training and evaluation data

model-index:
- name: pythia-6.9b-hc3-qa-assistant
  results:
  - task:
      name: Causal Language Modeling
      type: text-generation
    dataset:
      name: pszemraj/HC3-textgen-qa
    metrics:
    - name: Accuracy
      type: accuracy
      value: 0.6768941789814655

Training procedure

Two epochs on the pszemraj/HC3-textgen-qa dataset.

Training results

Training Loss Epoch Step Validation Loss Accuracy
1.2598 0.99 79 1.3291 0.6496
0.7446 1.99 158 1.2372 0.6769

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 33.33
ARC (25-shot) 36.52
HellaSwag (10-shot) 61.76
MMLU (5-shot) 26.94
TruthfulQA (0-shot) 45.05
Winogrande (5-shot) 60.77
GSM8K (5-shot) 0.0
DROP (3-shot) 2.23