Model Card for Llama-3.1-Argunaut-1-8B-SFT
This model is a fine-tuned version of meta-llama/Llama-3.1-8B-Instruct. It has been trained using TRL.
Quick start
from transformers import pipeline
question = "Are you familiar with Argdown syntax? What's its purpose?"
generator = pipeline("text-generation", model="DebateLabKIT/Llama-3.1-Argunaut-1-8B-SFT", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])
Evals
⚠️ NOTE: These self-reported results have been obtained with lm-eval-harness and using local-completions api; they deviate significantly from the official Open LLM Leaderboard evals, which are also reported at the end of this readme.
LM Eval Harness results (local completions/vllm): wandb report
Model | BBH | MATH | GPQA | MMLU Pro |
---|---|---|---|---|
Llama-3.1-Argunaut-1-8B-SFT | 44.6% | 9.0% | 32.1% | 34.5% |
SFT dataset mixture
Dataset | Weight (examples) | Weight (tokens) |
---|---|---|
DebateLabKIT/deepa2-conversations | 25% | 49% |
DebateLabKIT/deep-argmap-conversations | 25% | 18% |
allenai/tulu-3-sft-mixture | 50% | 33% |
Training procedure
Trained with SFT on 1M examples and for 1 epoch with
- context length 8196
- packing (trl implementation)
- spectrum (top 30 percent)
# Training parameters
num_train_epochs: 1
per_device_train_batch_size: 8
gradient_accumulation_steps: 2
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
learning_rate: 5.0e-6 # following _Tülu 3_ recipe
lr_scheduler_type: cosine
warmup_ratio: 0.1
Hardware: 2 x H100 GPUs.
This work was performed on the HoreKa supercomputer funded by the Ministry of Science, Research and the Arts Baden-Württemberg and by the Federal Ministry of Education and Research.
Framework versions
- TRL: 0.12.1
- Transformers: 4.46.3
- Pytorch: 2.4.1
- Datasets: 3.1.0
- Tokenizers: 0.20.3
Credits
This work wouldn't be possible without all the great contributions from the open LLM community. Thank you! Special kudos go to
- @philschmid for his latest fine-tuning boilerplate
- @lvwerra, @lewtun et al for building and maintaining trl
- @cognitivecomputations for sharing spectrum
Open LLM Leaderboard Evaluation Results
Detailed results can be found here! Summarized results can be found here!
Metric | Value (%) |
---|---|
Average | 23.56 |
IFEval (0-Shot) | 55.19 |
BBH (3-Shot) | 27.19 |
MATH Lvl 5 (4-Shot) | 11.18 |
GPQA (0-shot) | 4.47 |
MuSR (0-shot) | 15.85 |
MMLU-PRO (5-shot) | 27.47 |
- Downloads last month
- 98
Model tree for DebateLabKIT/Llama-3.1-Argunaut-1-8B-SFT
Base model
meta-llama/Llama-3.1-8BDatasets used to train DebateLabKIT/Llama-3.1-Argunaut-1-8B-SFT
Evaluation results
- averaged accuracy on IFEval (0-Shot)Open LLM Leaderboard55.190
- normalized accuracy on BBH (3-Shot)test set Open LLM Leaderboard27.190
- exact match on MATH Lvl 5 (4-Shot)test set Open LLM Leaderboard11.180
- acc_norm on GPQA (0-shot)Open LLM Leaderboard4.470
- acc_norm on MuSR (0-shot)Open LLM Leaderboard15.850
- accuracy on MMLU-PRO (5-shot)test set Open LLM Leaderboard27.470