File size: 8,026 Bytes
4b5b697 426cc25 80860a7 fcbdcbd 80860a7 4b5b697 fcbdcbd 24f2930 fcbdcbd 24f2930 fcbdcbd 24f2930 0e7af37 fcbdcbd 4b5b697 218ffd8 8766a3f 218ffd8 8766a3f 80860a7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 |
---
language:
- en
license: apache-2.0
tags:
- text-generation-inference
- transformers
- unsloth
- llama
- trl
- sft
base_model: unsloth/llama-3.2-3b-instruct-bnb-4bit
datasets:
- Lyte/Reasoning-Paused
pipeline_tag: text-generation
model-index:
- name: Llama-3.2-3B-Overthinker
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: IFEval (0-Shot)
type: HuggingFaceH4/ifeval
args:
num_few_shot: 0
metrics:
- type: inst_level_strict_acc and prompt_level_strict_acc
value: 64.08
name: strict accuracy
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=Lyte/Llama-3.2-3B-Overthinker
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: BBH (3-Shot)
type: BBH
args:
num_few_shot: 3
metrics:
- type: acc_norm
value: 20.1
name: normalized accuracy
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=Lyte/Llama-3.2-3B-Overthinker
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MATH Lvl 5 (4-Shot)
type: hendrycks/competition_math
args:
num_few_shot: 4
metrics:
- type: exact_match
value: 2.64
name: exact match
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=Lyte/Llama-3.2-3B-Overthinker
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GPQA (0-shot)
type: Idavidrein/gpqa
args:
num_few_shot: 0
metrics:
- type: acc_norm
value: 1.23
name: acc_norm
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=Lyte/Llama-3.2-3B-Overthinker
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MuSR (0-shot)
type: TAUR-Lab/MuSR
args:
num_few_shot: 0
metrics:
- type: acc_norm
value: 3.9
name: acc_norm
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=Lyte/Llama-3.2-3B-Overthinker
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU-PRO (5-shot)
type: TIGER-Lab/MMLU-Pro
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 22.06
name: accuracy
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=Lyte/Llama-3.2-3B-Overthinker
name: Open LLM Leaderboard
---
# Model Overview:
- **Training Data**: This model was trained on a dataset with columns for initial reasoning, step-by-step thinking, verifications after each step, and final answers based on full context. Is it better than the original base model? Hard to say without proper evaluations, and I don’t have the resources to run them manually.
- **Context Handling**: The model benefits from larger contexts (minimum 4k up to 16k tokens, though it was trained on 32k tokens). It tends to "overthink," so providing a longer context helps it perform better.
- **Performance**: Based on my very few manual tests, the model seems to excel in conversational settings—especially for mental health, creative tasks and explaining stuff. However, I encourage you to try it out yourself using this [Colab Notebook](https://colab.research.google.com/drive/1dcBbHAwYJuQJKqdPU570Hddv_F9wzjPO?usp=sharing).
- **Dataset Note**: The publicly available dataset is only a partial version. The full dataset was originally designed for a custom Mixture of Experts (MoE) architecture, but I couldn't afford to run the full experiment.
- **Acknowledgment**: Special thanks to KingNish for reigniting my passion to revisit this project. I almost abandoned it after my first attempt a month ago. Enjoy this experimental model!
# Inference Code:
- Feel free to make the steps and verifications collapsable and the initial reasoning too, you can show only the final answer to get an o1 feel(i don't know)
- **Note:** A feature we have here is the ability to control how many steps and verifications you want.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Lyte/Llama-3.2-3B-Overthinker"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
def generate_response(prompt, max_tokens=16384, temperature=0.8, top_p=0.95, repeat_penalty=1.1, num_steps=3):
messages = [{"role": "user", "content": prompt}]
# Generate reasoning
reasoning_template = tokenizer.apply_chat_template(messages, tokenize=False, add_reasoning_prompt=True)
reasoning_inputs = tokenizer(reasoning_template, return_tensors="pt").to(model.device)
reasoning_ids = model.generate(
**reasoning_inputs,
max_new_tokens=max_tokens // 3,
temperature=temperature,
top_p=top_p,
repetition_penalty=repeat_penalty
)
reasoning_output = tokenizer.decode(reasoning_ids[0, reasoning_inputs.input_ids.shape[1]:], skip_special_tokens=True)
# Generate thinking (step-by-step and verifications)
messages.append({"role": "reasoning", "content": reasoning_output})
thinking_template = tokenizer.apply_chat_template(messages, tokenize=False, add_thinking_prompt=True, num_steps=num_steps)
thinking_inputs = tokenizer(thinking_template, return_tensors="pt").to(model.device)
thinking_ids = model.generate(
**thinking_inputs,
max_new_tokens=max_tokens // 3,
temperature=temperature,
top_p=top_p,
repetition_penalty=repeat_penalty
)
thinking_output = tokenizer.decode(thinking_ids[0, thinking_inputs.input_ids.shape[1]:], skip_special_tokens=True)
# Generate final answer
messages.append({"role": "thinking", "content": thinking_output})
answer_template = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
answer_inputs = tokenizer(answer_template, return_tensors="pt").to(model.device)
answer_ids = model.generate(
**answer_inputs,
max_new_tokens=max_tokens // 3,
temperature=temperature,
top_p=top_p,
repetition_penalty=repeat_penalty
)
answer_output = tokenizer.decode(answer_ids[0, answer_inputs.input_ids.shape[1]:], skip_special_tokens=True)
return reasoning_output, thinking_output, answer_output
# Example usage:
prompt = "Explain the process of photosynthesis."
response = generate_response(prompt, num_steps=5)
print("Response:", response)
```
# Uploaded model
- **Developed by:** Lyte
- **License:** apache-2.0
- **Finetuned from model :** unsloth/llama-3.2-3b-instruct-bnb-4bit
This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
# Notice:
- **The problem with runnning evals is that they won't make use of the correct template and it won't be a true eval then would it? so these barely test the model.**
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_Lyte__Llama-3.2-3B-Overthinker)
| Metric |Value|
|-------------------|----:|
|Avg. |19.00|
|IFEval (0-Shot) |64.08|
|BBH (3-Shot) |20.10|
|MATH Lvl 5 (4-Shot)| 2.64|
|GPQA (0-shot) | 1.23|
|MuSR (0-shot) | 3.90|
|MMLU-PRO (5-shot) |22.06|
|