Model Card for Llama-3.1-Argunaut-1-8B-SFT

This model is a fine-tuned version of meta-llama/Llama-3.1-8B-Instruct. It has been trained using TRL.

Quick start

from transformers import pipeline

question = "Are you familiar with Argdown syntax? What's its purpose?"
generator = pipeline("text-generation", model="DebateLabKIT/Llama-3.1-Argunaut-1-8B-SFT", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])

Evals

⚠️ NOTE: These self-reported results have been obtained with lm-eval-harness and using local-completions api; they deviate significantly from the official Open LLM Leaderboard evals, which are also reported at the end of this readme.

LM Eval Harness results (local completions/vllm): wandb report

Model BBH MATH GPQA MMLU Pro
Llama-3.1-Argunaut-1-8B-SFT 44.6% 9.0% 32.1% 34.5%

SFT dataset mixture

Dataset Weight (examples) Weight (tokens)
DebateLabKIT/deepa2-conversations 25% 49%
DebateLabKIT/deep-argmap-conversations 25% 18%
allenai/tulu-3-sft-mixture 50% 33%

Training procedure

Trained with SFT on 1M examples and for 1 epoch with

  • context length 8196
  • packing (trl implementation)
  • spectrum (top 30 percent)
# Training parameters
num_train_epochs: 1
per_device_train_batch_size: 8
gradient_accumulation_steps: 2
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
learning_rate: 5.0e-6  # following _Tülu 3_ recipe
lr_scheduler_type: cosine
warmup_ratio: 0.1

Hardware: 2 x H100 GPUs.

This work was performed on the HoreKa supercomputer funded by the Ministry of Science, Research and the Arts Baden-Württemberg and by the Federal Ministry of Education and Research.

Framework versions

  • TRL: 0.12.1
  • Transformers: 4.46.3
  • Pytorch: 2.4.1
  • Datasets: 3.1.0
  • Tokenizers: 0.20.3

Credits

This work wouldn't be possible without all the great contributions from the open LLM community. Thank you! Special kudos go to

Open LLM Leaderboard Evaluation Results

Detailed results can be found here! Summarized results can be found here!

Metric Value (%)
Average 23.56
IFEval (0-Shot) 55.19
BBH (3-Shot) 27.19
MATH Lvl 5 (4-Shot) 11.18
GPQA (0-shot) 4.47
MuSR (0-shot) 15.85
MMLU-PRO (5-shot) 27.47
Downloads last month
98
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for DebateLabKIT/Llama-3.1-Argunaut-1-8B-SFT

Finetuned
(609)
this model
Merges
1 model
Quantizations
6 models

Datasets used to train DebateLabKIT/Llama-3.1-Argunaut-1-8B-SFT

Evaluation results