license: llama3.1
datasets:
- DebateLabKIT/deepa2-conversations
- DebateLabKIT/deep-argmap-conversations
- allenai/tulu-3-sft-mixture
base_model:
- meta-llama/Llama-3.1-8B-Instruct
pipeline_tag: text-generation
library_name: transformers
tags:
- logic
- argumentation
- critical-thinking
- argument-mapping
- trl
- sft
model-index:
- name: Llama-3.1-Argunaut-1-8B-SFT
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: IFEval (0-Shot)
type: wis-k/instruction-following-eval
split: train
args:
num_few_shot: 0
metrics:
- type: inst_level_strict_acc and prompt_level_strict_acc
value: 55.19
name: averaged accuracy
source:
url: >-
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: BBH (3-Shot)
type: SaylorTwift/bbh
split: test
args:
num_few_shot: 3
metrics:
- type: acc_norm
value: 27.19
name: normalized accuracy
source:
url: >-
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MATH Lvl 5 (4-Shot)
type: lighteval/MATH-Hard
split: test
args:
num_few_shot: 4
metrics:
- type: exact_match
value: 11.18
name: exact match
source:
url: >-
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GPQA (0-shot)
type: Idavidrein/gpqa
split: train
args:
num_few_shot: 0
metrics:
- type: acc_norm
value: 4.47
name: acc_norm
source:
url: >-
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MuSR (0-shot)
type: TAUR-Lab/MuSR
args:
num_few_shot: 0
metrics:
- type: acc_norm
value: 15.85
name: acc_norm
source:
url: >-
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU-PRO (5-shot)
type: TIGER-Lab/MMLU-Pro
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 27.47
name: accuracy
source:
url: >-
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT
name: Open LLM Leaderboard
Model Card for Llama-3.1-Argunaut-1-8B-SFT
This model is a fine-tuned version of meta-llama/Llama-3.1-8B-Instruct. It has been trained using TRL.
Quick start
from transformers import pipeline
question = "Are you familiar with Argdown syntax? What's its purpose?"
generator = pipeline("text-generation", model="DebateLabKIT/Llama-3.1-Argunaut-1-8B-SFT", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])
Evals
⚠️ NOTE: These self-reported results have been obtained with lm-eval-harness and using local-completions api; they deviate significantly from the official Open LLM Leaderboard evals, which are also reported at the end of this readme.
LM Eval Harness results (local completions/vllm): wandb report
Model | BBH | MATH | GPQA | MMLU Pro |
---|---|---|---|---|
Llama-3.1-Argunaut-1-8B-SFT | 44.6% | 9.0% | 32.1% | 34.5% |
SFT dataset mixture
Dataset | Weight (examples) | Weight (tokens) |
---|---|---|
DebateLabKIT/deepa2-conversations | 25% | 49% |
DebateLabKIT/deep-argmap-conversations | 25% | 18% |
allenai/tulu-3-sft-mixture | 50% | 33% |
Training procedure
Trained with SFT on 1M examples and for 1 epoch with
- context length 8196
- packing (trl implementation)
- spectrum (top 30 percent)
# Training parameters
num_train_epochs: 1
per_device_train_batch_size: 8
gradient_accumulation_steps: 2
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
learning_rate: 5.0e-6 # following _Tülu 3_ recipe
lr_scheduler_type: cosine
warmup_ratio: 0.1
Hardware: 2 x H100 GPUs.
This work was performed on the HoreKa supercomputer funded by the Ministry of Science, Research and the Arts Baden-Württemberg and by the Federal Ministry of Education and Research.
Framework versions
- TRL: 0.12.1
- Transformers: 4.46.3
- Pytorch: 2.4.1
- Datasets: 3.1.0
- Tokenizers: 0.20.3
Credits
This work wouldn't be possible without all the great contributions from the open LLM community. Thank you! Special kudos go to
- @philschmid for his latest fine-tuning boilerplate
- @lvwerra, @lewtun et al for building and maintaining trl
- @cognitivecomputations for sharing spectrum
Open LLM Leaderboard Evaluation Results
Detailed results can be found here! Summarized results can be found here!
Metric | Value (%) |
---|---|
Average | 23.56 |
IFEval (0-Shot) | 55.19 |
BBH (3-Shot) | 27.19 |
MATH Lvl 5 (4-Shot) | 11.18 |
GPQA (0-shot) | 4.47 |
MuSR (0-shot) | 15.85 |
MMLU-PRO (5-shot) | 27.47 |