|
--- |
|
license: llama3.1 |
|
datasets: |
|
- DebateLabKIT/deepa2-conversations |
|
- DebateLabKIT/deep-argmap-conversations |
|
- allenai/tulu-3-sft-mixture |
|
base_model: |
|
- meta-llama/Llama-3.1-8B-Instruct |
|
pipeline_tag: text-generation |
|
library_name: transformers |
|
tags: |
|
- logic |
|
- argumentation |
|
- critical-thinking |
|
- argument-mapping |
|
- trl |
|
- sft |
|
model-index: |
|
- name: Llama-3.1-Argunaut-1-8B-SFT |
|
results: |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: IFEval (0-Shot) |
|
type: wis-k/instruction-following-eval |
|
split: train |
|
args: |
|
num_few_shot: 0 |
|
metrics: |
|
- type: inst_level_strict_acc and prompt_level_strict_acc |
|
value: 55.19 |
|
name: averaged accuracy |
|
source: |
|
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: BBH (3-Shot) |
|
type: SaylorTwift/bbh |
|
split: test |
|
args: |
|
num_few_shot: 3 |
|
metrics: |
|
- type: acc_norm |
|
value: 27.19 |
|
name: normalized accuracy |
|
source: |
|
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: MATH Lvl 5 (4-Shot) |
|
type: lighteval/MATH-Hard |
|
split: test |
|
args: |
|
num_few_shot: 4 |
|
metrics: |
|
- type: exact_match |
|
value: 11.18 |
|
name: exact match |
|
source: |
|
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: GPQA (0-shot) |
|
type: Idavidrein/gpqa |
|
split: train |
|
args: |
|
num_few_shot: 0 |
|
metrics: |
|
- type: acc_norm |
|
value: 4.47 |
|
name: acc_norm |
|
source: |
|
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: MuSR (0-shot) |
|
type: TAUR-Lab/MuSR |
|
args: |
|
num_few_shot: 0 |
|
metrics: |
|
- type: acc_norm |
|
value: 15.85 |
|
name: acc_norm |
|
source: |
|
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: MMLU-PRO (5-shot) |
|
type: TIGER-Lab/MMLU-Pro |
|
config: main |
|
split: test |
|
args: |
|
num_few_shot: 5 |
|
metrics: |
|
- type: acc |
|
value: 27.47 |
|
name: accuracy |
|
source: |
|
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT |
|
name: Open LLM Leaderboard |
|
--- |
|
|
|
|
|
# Model Card for Llama-3.1-Argunaut-1-8B-SFT |
|
|
|
This model is a fine-tuned version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct). |
|
It has been trained using [TRL](https://github.com/huggingface/trl). |
|
|
|
## Quick start |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
question = "Are you familiar with Argdown syntax? What's its purpose?" |
|
generator = pipeline("text-generation", model="DebateLabKIT/Llama-3.1-Argunaut-1-8B-SFT", device="cuda") |
|
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0] |
|
print(output["generated_text"]) |
|
``` |
|
|
|
## Evals |
|
|
|
**⚠️ NOTE**: These self-reported results have been obtained with lm-eval-harness and using local-completions api; they deviate significantly from the official [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/) evals, which are also reported at the end of this readme. |
|
|
|
LM Eval Harness results (local completions/vllm): [wandb report](https://api.wandb.ai/links/ggbetz/3bwr0ou6) |
|
|
|
|Model|BBH|MATH|GPQA|MMLU Pro| |
|
|:--------|:---:|:---:|:---:|:---:| |
|
| Llama-3.1-Argunaut-1-8B-SFT | 44.6% | 9.0% | 32.1% | 34.5% | |
|
|
|
|
|
## SFT dataset mixture |
|
|
|
|Dataset|Weight (examples)|Weight (tokens)| |
|
|:------|:----:|:----:| |
|
|DebateLabKIT/deepa2-conversations|25%|49%| |
|
|DebateLabKIT/deep-argmap-conversations|25%|18%| |
|
|allenai/tulu-3-sft-mixture|50%|33%| |
|
|
|
|
|
## Training procedure |
|
|
|
Trained with SFT on **1M examples** and for 1 epoch with |
|
|
|
* context length 8196 |
|
* packing (trl implementation) |
|
* *spectrum* (top 30 percent) |
|
|
|
```yaml |
|
# Training parameters |
|
num_train_epochs: 1 |
|
per_device_train_batch_size: 8 |
|
gradient_accumulation_steps: 2 |
|
gradient_checkpointing: true |
|
gradient_checkpointing_kwargs: |
|
use_reentrant: false |
|
learning_rate: 5.0e-6 # following _Tülu 3_ recipe |
|
lr_scheduler_type: cosine |
|
warmup_ratio: 0.1 |
|
``` |
|
|
|
Hardware: 2 x H100 GPUs. |
|
|
|
_This work was performed on the HoreKa supercomputer funded by the |
|
Ministry of Science, Research and the Arts Baden-Württemberg and by |
|
the Federal Ministry of Education and Research._ |
|
|
|
### Framework versions |
|
|
|
- TRL: 0.12.1 |
|
- Transformers: 4.46.3 |
|
- Pytorch: 2.4.1 |
|
- Datasets: 3.1.0 |
|
- Tokenizers: 0.20.3 |
|
|
|
## Credits |
|
|
|
This work wouldn't be possible without all the **great contributions from the open LLM community**. Thank you! Special kudos go to |
|
|
|
- @philschmid for his latest [fine-tuning boilerplate](https://www.philschmid.de/fine-tune-llms-in-2025) |
|
- @lvwerra, @lewtun et al for building and maintaining [trl](https://github.com/huggingface/trl) |
|
- @cognitivecomputations for sharing [spectrum](https://github.com/cognitivecomputations/spectrum/tree/main) |
|
|
|
|
|
|
|
## [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) |
|
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/DebateLabKIT__Llama-3.1-Argunaut-1-8B-SFT-details)! |
|
Summarized results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/contents/viewer/default/train?q=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT&sort[column]=Average%20%E2%AC%86%EF%B8%8F&sort[direction]=desc)! |
|
|
|
| Metric |Value (%)| |
|
|-------------------|--------:| |
|
|**Average** | 23.56| |
|
|IFEval (0-Shot) | 55.19| |
|
|BBH (3-Shot) | 27.19| |
|
|MATH Lvl 5 (4-Shot)| 11.18| |
|
|GPQA (0-shot) | 4.47| |
|
|MuSR (0-shot) | 15.85| |
|
|MMLU-PRO (5-shot) | 27.47| |
|
|
|
|