T145's picture
Adding Evaluation Results
e814e6a verified
|
raw
history blame
6.67 kB
---
license: llama3.1
datasets:
- DebateLabKIT/deepa2-conversations
- DebateLabKIT/deep-argmap-conversations
- allenai/tulu-3-sft-mixture
base_model:
- meta-llama/Llama-3.1-8B-Instruct
pipeline_tag: text-generation
library_name: transformers
tags:
- logic
- argumentation
- critical-thinking
- argument-mapping
- trl
- sft
model-index:
- name: Llama-3.1-Argunaut-1-8B-SFT
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: IFEval (0-Shot)
type: wis-k/instruction-following-eval
split: train
args:
num_few_shot: 0
metrics:
- type: inst_level_strict_acc and prompt_level_strict_acc
value: 55.19
name: averaged accuracy
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: BBH (3-Shot)
type: SaylorTwift/bbh
split: test
args:
num_few_shot: 3
metrics:
- type: acc_norm
value: 27.19
name: normalized accuracy
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MATH Lvl 5 (4-Shot)
type: lighteval/MATH-Hard
split: test
args:
num_few_shot: 4
metrics:
- type: exact_match
value: 11.18
name: exact match
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GPQA (0-shot)
type: Idavidrein/gpqa
split: train
args:
num_few_shot: 0
metrics:
- type: acc_norm
value: 4.47
name: acc_norm
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MuSR (0-shot)
type: TAUR-Lab/MuSR
args:
num_few_shot: 0
metrics:
- type: acc_norm
value: 15.85
name: acc_norm
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU-PRO (5-shot)
type: TIGER-Lab/MMLU-Pro
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 27.47
name: accuracy
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT
name: Open LLM Leaderboard
---
# Model Card for Llama-3.1-Argunaut-1-8B-SFT
This model is a fine-tuned version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).
It has been trained using [TRL](https://github.com/huggingface/trl).
## Quick start
```python
from transformers import pipeline
question = "Are you familiar with Argdown syntax? What's its purpose?"
generator = pipeline("text-generation", model="DebateLabKIT/Llama-3.1-Argunaut-1-8B-SFT", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])
```
## Evals
LM Eval Harness results (local completions/vllm): [wandb report](https://api.wandb.ai/links/ggbetz/3bwr0ou6)
Pinning `Llama-3.1-Argunaut-1-8B-SFT` against top-performing LLama-8B models from [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/):
|Model|BBH|MATH|GPQA|MMLU Pro|
|:--------|:---:|:---:|:---:|:---:|
| **Llama-3.1-Argunaut-1-8B-SFT** | 44.6% | 9.0% | 32.1% | 34.5% |
| meta-llama/Meta-Llama-3.1-8B-Instruct | 29.9% | 19.3% | 2.6% | 30.7% |
| arcee-ai/Llama-3.1-SuperNova-Lite | 31.6% | 17.4% | 7.5% | 32.0% |
| allenai/Llama-3.1-Tulu-3-8B-SFT | 13.9% | 11.4% | 3.7% | 20.1% |
## SFT dataset mixture
|Dataset|Weight (examples)|Weight (tokens)|
|:------|:----:|:----:|
|DebateLabKIT/deepa2-conversations|25%|49%|
|DebateLabKIT/deep-argmap-conversations|25%|18%|
|allenai/tulu-3-sft-mixture|50%|33%|
## Training procedure
Trained with SFT on **1M examples** and for 1 epoch with
* context length 8196
* packing (trl implementation)
* *spectrum* (top 30 percent)
```yaml
# Training parameters
num_train_epochs: 1
per_device_train_batch_size: 8
gradient_accumulation_steps: 2
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
learning_rate: 5.0e-6 # following _Tülu 3_ recipe
lr_scheduler_type: cosine
warmup_ratio: 0.1
```
Hardware: 2 x H100 GPUs.
_This work was performed on the HoreKa supercomputer funded by the
Ministry of Science, Research and the Arts Baden-Württemberg and by
the Federal Ministry of Education and Research._
### Framework versions
- TRL: 0.12.1
- Transformers: 4.46.3
- Pytorch: 2.4.1
- Datasets: 3.1.0
- Tokenizers: 0.20.3
## Credits
This work wouldn't be possible without all the **great contributions from the open LLM community**. Thank you! Special kudos go to
- @philschmid for his latest [fine-tuning boilerplate](https://www.philschmid.de/fine-tune-llms-in-2025)
- @lvwerra, @lewtun et al for building and maintaining [trl](https://github.com/huggingface/trl)
- @cognitivecomputations for sharing [spectrum](https://github.com/cognitivecomputations/spectrum/tree/main)
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/DebateLabKIT__Llama-3.1-Argunaut-1-8B-SFT-details)!
Summarized results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/contents/viewer/default/train?q=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT&sort[column]=Average%20%E2%AC%86%EF%B8%8F&sort[direction]=desc)!
| Metric |Value (%)|
|-------------------|--------:|
|**Average** | 23.56|
|IFEval (0-Shot) | 55.19|
|BBH (3-Shot) | 27.19|
|MATH Lvl 5 (4-Shot)| 11.18|
|GPQA (0-shot) | 4.47|
|MuSR (0-shot) | 15.85|
|MMLU-PRO (5-shot) | 27.47|