Adding Evaluation Results

e814e6a verified 6 days ago

6.67 kB

	---
	license: llama3.1
	datasets:
	- DebateLabKIT/deepa2-conversations
	- DebateLabKIT/deep-argmap-conversations
	- allenai/tulu-3-sft-mixture
	base_model:
	- meta-llama/Llama-3.1-8B-Instruct
	pipeline_tag: text-generation
	library_name: transformers
	tags:
	- logic
	- argumentation
	- critical-thinking
	- argument-mapping
	- trl
	- sft
	model-index:
	- name: Llama-3.1-Argunaut-1-8B-SFT
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: IFEval (0-Shot)
	type: wis-k/instruction-following-eval
	split: train
	args:
	num_few_shot: 0
	metrics:
	- type: inst_level_strict_acc and prompt_level_strict_acc
	value: 55.19
	name: averaged accuracy
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: BBH (3-Shot)
	type: SaylorTwift/bbh
	split: test
	args:
	num_few_shot: 3
	metrics:
	- type: acc_norm
	value: 27.19
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MATH Lvl 5 (4-Shot)
	type: lighteval/MATH-Hard
	split: test
	args:
	num_few_shot: 4
	metrics:
	- type: exact_match
	value: 11.18
	name: exact match
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GPQA (0-shot)
	type: Idavidrein/gpqa
	split: train
	args:
	num_few_shot: 0
	metrics:
	- type: acc_norm
	value: 4.47
	name: acc_norm
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MuSR (0-shot)
	type: TAUR-Lab/MuSR
	args:
	num_few_shot: 0
	metrics:
	- type: acc_norm
	value: 15.85
	name: acc_norm
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU-PRO (5-shot)
	type: TIGER-Lab/MMLU-Pro
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 27.47
	name: accuracy
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT
	name: Open LLM Leaderboard
	---


	# Model Card for Llama-3.1-Argunaut-1-8B-SFT

	This model is a fine-tuned version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).
	It has been trained using [TRL](https://github.com/huggingface/trl).

	## Quick start

	```python
	from transformers import pipeline

	question = "Are you familiar with Argdown syntax? What's its purpose?"
	generator = pipeline("text-generation", model="DebateLabKIT/Llama-3.1-Argunaut-1-8B-SFT", device="cuda")
	output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
	print(output["generated_text"])
	```

	## Evals

	LM Eval Harness results (local completions/vllm): [wandb report](https://api.wandb.ai/links/ggbetz/3bwr0ou6)

	Pinning `Llama-3.1-Argunaut-1-8B-SFT` against top-performing LLama-8B models from [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/):

	\|Model\|BBH\|MATH\|GPQA\|MMLU Pro\|
	\|:--------\|:---:\|:---:\|:---:\|:---:\|
	\| Llama-3.1-Argunaut-1-8B-SFT \| 44.6% \| 9.0% \| 32.1% \| 34.5% \|
	\| meta-llama/Meta-Llama-3.1-8B-Instruct \| 29.9% \| 19.3% \| 2.6% \| 30.7% \|
	\| arcee-ai/Llama-3.1-SuperNova-Lite \| 31.6% \| 17.4% \| 7.5% \| 32.0% \|
	\| allenai/Llama-3.1-Tulu-3-8B-SFT \| 13.9% \| 11.4% \| 3.7% \| 20.1% \|


	## SFT dataset mixture

	\|Dataset\|Weight (examples)\|Weight (tokens)\|
	\|:------\|:----:\|:----:\|
	\|DebateLabKIT/deepa2-conversations\|25%\|49%\|
	\|DebateLabKIT/deep-argmap-conversations\|25%\|18%\|
	\|allenai/tulu-3-sft-mixture\|50%\|33%\|


	## Training procedure

	Trained with SFT on 1M examples and for 1 epoch with

	* context length 8196
	* packing (trl implementation)
	* spectrum (top 30 percent)

	```yaml
	# Training parameters
	num_train_epochs: 1
	per_device_train_batch_size: 8
	gradient_accumulation_steps: 2
	gradient_checkpointing: true
	gradient_checkpointing_kwargs:
	use_reentrant: false
	learning_rate: 5.0e-6 # following _Tülu 3_ recipe
	lr_scheduler_type: cosine
	warmup_ratio: 0.1
	```

	Hardware: 2 x H100 GPUs.

	_This work was performed on the HoreKa supercomputer funded by the
	Ministry of Science, Research and the Arts Baden-Württemberg and by
	the Federal Ministry of Education and Research._

	### Framework versions

	- TRL: 0.12.1
	- Transformers: 4.46.3
	- Pytorch: 2.4.1
	- Datasets: 3.1.0
	- Tokenizers: 0.20.3

	## Credits

	This work wouldn't be possible without all the great contributions from the open LLM community. Thank you! Special kudos go to

	- @philschmid for his latest [fine-tuning boilerplate](https://www.philschmid.de/fine-tune-llms-in-2025)
	- @lvwerra, @lewtun et al for building and maintaining [trl](https://github.com/huggingface/trl)
	- @cognitivecomputations for sharing [spectrum](https://github.com/cognitivecomputations/spectrum/tree/main)



	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/DebateLabKIT__Llama-3.1-Argunaut-1-8B-SFT-details)!
	Summarized results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/contents/viewer/default/train?q=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT&sort[column]=Average%20%E2%AC%86%EF%B8%8F&sort[direction]=desc)!

	\| Metric \|Value (%)\|
	\|-------------------\|--------:\|
	\|Average \| 23.56\|
	\|IFEval (0-Shot) \| 55.19\|
	\|BBH (3-Shot) \| 27.19\|
	\|MATH Lvl 5 (4-Shot)\| 11.18\|
	\|GPQA (0-shot) \| 4.47\|
	\|MuSR (0-shot) \| 15.85\|
	\|MMLU-PRO (5-shot) \| 27.47\|