prince-canuma
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -7,17 +7,13 @@ datasets:
|
|
7 |
- prince-canuma/fineweb-CC-MAIN-2024-10-1B-en
|
8 |
---
|
9 |
|
10 |
-
# Model
|
11 |
-
|
12 |
-
<!-- Provide a quick summary of what the model is/does. -->
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
## Model Details
|
17 |
-
|
18 |
-
### Model Description
|
19 |
<img src="llama-3-6B icon.jpeg" width="500" alt="Llama-3-6B"/>
|
20 |
|
|
|
|
|
|
|
|
|
21 |
<!-- Provide a longer summary of what this model is. -->
|
22 |
|
23 |
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
|
@@ -30,61 +26,72 @@ This is the model card of a 🤗 transformers model that has been pushed on the
|
|
30 |
- **License:** [More Information Needed]
|
31 |
- **Finetuned from model [optional]:** [More Information Needed]
|
32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
### Model Sources [optional]
|
34 |
|
35 |
<!-- Provide the basic links for the model. -->
|
36 |
|
37 |
-
- **Repository:**
|
38 |
-
- **
|
39 |
-
- **Demo [optional]:** [More Information Needed]
|
40 |
|
41 |
## Uses
|
42 |
|
43 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
|
|
44 |
|
45 |
-
###
|
46 |
-
|
47 |
-
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
48 |
-
|
49 |
-
[More Information Needed]
|
50 |
-
|
51 |
-
### Downstream Use [optional]
|
52 |
-
|
53 |
-
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
|
54 |
-
|
55 |
-
[More Information Needed]
|
56 |
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
[More Information Needed]
|
62 |
|
63 |
-
##
|
64 |
|
65 |
-
|
66 |
|
67 |
-
|
|
|
68 |
|
69 |
-
|
|
|
|
|
|
|
70 |
|
71 |
-
|
|
|
|
|
|
|
72 |
|
73 |
-
|
|
|
|
|
74 |
|
75 |
-
|
76 |
|
77 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
78 |
|
79 |
-
[More Information Needed]
|
80 |
|
81 |
## Training Details
|
82 |
|
83 |
### Training Data
|
84 |
|
85 |
-
|
86 |
-
|
87 |
-
[More Information Needed]
|
88 |
|
89 |
### Training Procedure
|
90 |
|
@@ -95,15 +102,45 @@ Use the code below to get started with the model.
|
|
95 |
[More Information Needed]
|
96 |
|
97 |
|
98 |
-
#### Training
|
99 |
|
100 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
101 |
|
102 |
-
|
103 |
|
104 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
105 |
|
106 |
-
[More Information Needed]
|
107 |
|
108 |
## Evaluation
|
109 |
|
@@ -136,42 +173,12 @@ Use the code below to get started with the model.
|
|
136 |
#### Summary
|
137 |
|
138 |
|
139 |
-
|
140 |
## Model Examination [optional]
|
141 |
|
142 |
<!-- Relevant interpretability work for the model goes here -->
|
143 |
|
144 |
[More Information Needed]
|
145 |
|
146 |
-
## Environmental Impact
|
147 |
-
|
148 |
-
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
149 |
-
|
150 |
-
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
151 |
-
|
152 |
-
- **Hardware Type:** [More Information Needed]
|
153 |
-
- **Hours used:** [More Information Needed]
|
154 |
-
- **Cloud Provider:** [More Information Needed]
|
155 |
-
- **Compute Region:** [More Information Needed]
|
156 |
-
- **Carbon Emitted:** [More Information Needed]
|
157 |
-
|
158 |
-
## Technical Specifications [optional]
|
159 |
-
|
160 |
-
### Model Architecture and Objective
|
161 |
-
|
162 |
-
[More Information Needed]
|
163 |
-
|
164 |
-
### Compute Infrastructure
|
165 |
-
|
166 |
-
[More Information Needed]
|
167 |
-
|
168 |
-
#### Hardware
|
169 |
-
|
170 |
-
[More Information Needed]
|
171 |
-
|
172 |
-
#### Software
|
173 |
-
|
174 |
-
[More Information Needed]
|
175 |
|
176 |
## Citation [optional]
|
177 |
|
@@ -185,4 +192,84 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
|
|
185 |
author={Prince Canuma},
|
186 |
year={2024},
|
187 |
}
|
188 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
- prince-canuma/fineweb-CC-MAIN-2024-10-1B-en
|
8 |
---
|
9 |
|
10 |
+
# Model Summary
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
<img src="llama-3-6B icon.jpeg" width="500" alt="Llama-3-6B"/>
|
12 |
|
13 |
+
This is world's first Llama-3 base model with 6B params, it is a pretrained version of [prince-canuma/Llama-3-6B-v0](https://huggingface.co/prince-canuma/Llama-3-6B-v0) which was, downcycled from Meta-Llama-3-8B.
|
14 |
+
It was continually pretrained on 1B tokens of enlish only text from fineweb and achieves the following results on the evaluation set:
|
15 |
+
- Loss: 2.4942
|
16 |
+
|
17 |
<!-- Provide a longer summary of what this model is. -->
|
18 |
|
19 |
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
|
|
|
26 |
- **License:** [More Information Needed]
|
27 |
- **Finetuned from model [optional]:** [More Information Needed]
|
28 |
|
29 |
+
## Model Description
|
30 |
+
|
31 |
+
<!-- Provide a longer summary of what this model is. -->
|
32 |
+
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
|
33 |
+
|
34 |
+
- **Developed by:** [Prince Canuma](https://huggingface.co/prince-canuma)
|
35 |
+
- **Model type:** Transformer
|
36 |
+
- **License:** MIT
|
37 |
+
- **Finetuned from model:** prince-canuma/Llama-3-6B-v0
|
38 |
+
|
39 |
### Model Sources [optional]
|
40 |
|
41 |
<!-- Provide the basic links for the model. -->
|
42 |
|
43 |
+
- **Repository:** https://github.com/Blaizzy/Coding-LLMs-from-scratch/tree/main/Llama-3
|
44 |
+
- **Video [optional]:** https://youtube.com/playlist?list=PLDn_JsyofyfTH5_5V1MNb8UYKxMl6IMNy&si=5Y4cm-6wrMOD1Abr
|
|
|
45 |
|
46 |
## Uses
|
47 |
|
48 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
49 |
+
You can use this model to create instruct and chat versions for various use cases such as: Coding assistant, RAG, Function Calling and more.
|
50 |
|
51 |
+
### Limitations
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
52 |
|
53 |
+
This model inherits some of the base model's limitations and some additional ones from it's creation process, such as:
|
54 |
+
- Limited scope for coding and math: According to benchmarks, this model needs more pretraining/finetuning on code and math data to excel at reasoning tasks.
|
55 |
+
- Language Limitations: This model was continually pretrained on english only data. If you are planning to use it for multilingual use cases I recommend fine-tuning or continued pretraining.
|
|
|
|
|
56 |
|
57 |
+
## How to Get Started with the Model
|
58 |
|
59 |
+
Use the code below to get started with the model.
|
60 |
|
61 |
+
```python
|
62 |
+
from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer
|
63 |
|
64 |
+
# Load model, config and tokenizer
|
65 |
+
model_name = "prince-canuma/Llama-3-6B-v0.1"
|
66 |
+
model = AutoModelForCausalLM.from_pretrained(model_name)
|
67 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
68 |
|
69 |
+
inputs = tokenizer(
|
70 |
+
[
|
71 |
+
"Who created Python?"
|
72 |
+
], return_tensors = "pt")
|
73 |
|
74 |
+
from transformers import TextStreamer
|
75 |
+
text_streamer = TextStreamer(tokenizer)
|
76 |
+
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 200)
|
77 |
|
78 |
+
```
|
79 |
|
80 |
+
Output:
|
81 |
+
```shell
|
82 |
+
<|begin_of_text|>Who created Python? What is Python used for? What is the difference between Python 2 and Python 3? What is the difference between Python and Python 3?
|
83 |
+
Python is a programming language that was created by Guido van Rossum in 1991. It is a widely used language for web development, data science, and machine learning. Python is also used for creating software applications and games.
|
84 |
+
Python is a powerful language that is easy to learn and use. It has a large library of built-in functions and packages that make it easy to write code. Python is also a very popular language for web development, with many popular web frameworks such as Django and Flask being written in Python.
|
85 |
+
Python is also used for data science and machine learning. It has a large library of packages for data analysis, machine learning, and artificial intelligence. Python is also used for creating software applications and games.
|
86 |
+
Python 2 and Python 3 are two different versions of the Python language. Python 2 was the original version of the
|
87 |
+
```
|
88 |
|
|
|
89 |
|
90 |
## Training Details
|
91 |
|
92 |
### Training Data
|
93 |
|
94 |
+
For continued pretrained, I extracted 1B tokens from [Huggingface's FineWeb CC-Main-2024-10](https://huggingface.co/datasets/HuggingFaceFW/fineweb#breakdown-by-dumpcrawl) slice.
|
|
|
|
|
95 |
|
96 |
### Training Procedure
|
97 |
|
|
|
102 |
[More Information Needed]
|
103 |
|
104 |
|
105 |
+
#### Training hyperparameters
|
106 |
|
107 |
+
The following hyperparameters were used during training:
|
108 |
+
- learning_rate: 0.0002
|
109 |
+
- train_batch_size: 2
|
110 |
+
- eval_batch_size: 2
|
111 |
+
- seed: 42
|
112 |
+
- distributed_type: multi-GPU
|
113 |
+
- num_devices: 4
|
114 |
+
- gradient_accumulation_steps: 8
|
115 |
+
- total_train_batch_size: 64
|
116 |
+
- total_eval_batch_size: 8
|
117 |
+
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
118 |
+
- lr_scheduler_type: cosine
|
119 |
+
- lr_scheduler_warmup_steps: 100
|
120 |
+
- num_epochs: 2
|
121 |
|
122 |
+
### Training results
|
123 |
|
124 |
+
| Training Loss | Epoch | Step | Validation Loss |
|
125 |
+
|:-------------:|:-----:|:-----:|:---------------:|
|
126 |
+
| 7.1562 | 0.0 | 1 | 7.1806 |
|
127 |
+
| 2.7339 | 0.25 | 5867 | 2.6266 |
|
128 |
+
| 2.6905 | 0.5 | 11734 | 2.5872 |
|
129 |
+
| 2.6134 | 0.75 | 17601 | 2.5549 |
|
130 |
+
| 2.532 | 1.0 | 23468 | 2.5235 |
|
131 |
+
| 2.5319 | 1.25 | 29335 | 2.5067 |
|
132 |
+
| 2.3336 | 1.5 | 35202 | 2.4968 |
|
133 |
+
| 2.3486 | 1.75 | 41069 | 2.4942 |
|
134 |
+
|
135 |
+
|
136 |
+
### Framework versions
|
137 |
+
|
138 |
+
- PEFT 0.10.0
|
139 |
+
- Transformers 4.40.0.dev0
|
140 |
+
- Pytorch 2.2.0+cu121
|
141 |
+
- Datasets 2.15.0
|
142 |
+
- Tokenizers 0.15.0
|
143 |
|
|
|
144 |
|
145 |
## Evaluation
|
146 |
|
|
|
173 |
#### Summary
|
174 |
|
175 |
|
|
|
176 |
## Model Examination [optional]
|
177 |
|
178 |
<!-- Relevant interpretability work for the model goes here -->
|
179 |
|
180 |
[More Information Needed]
|
181 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
182 |
|
183 |
## Citation [optional]
|
184 |
|
|
|
192 |
author={Prince Canuma},
|
193 |
year={2024},
|
194 |
}
|
195 |
+
```
|
196 |
+
|
197 |
+
[<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)
|
198 |
+
<details><summary>See axolotl config</summary>
|
199 |
+
|
200 |
+
axolotl version: `0.4.0`
|
201 |
+
```yaml
|
202 |
+
base_model: prince-canuma/Llama-3-6B-v0.1
|
203 |
+
model_type: AutoModelForCausalLM
|
204 |
+
tokenizer_type: AutoTokenizer
|
205 |
+
|
206 |
+
load_in_8bit: false
|
207 |
+
load_in_4bit: true
|
208 |
+
strict: false
|
209 |
+
|
210 |
+
datasets:
|
211 |
+
- path: prince-canuma/fineweb-CC-MAIN-2024-10-1B-en
|
212 |
+
type: completion
|
213 |
+
split: train
|
214 |
+
dataset_prepared_path: last_run_prepared
|
215 |
+
val_set_size: 0.001
|
216 |
+
output_dir: ./llama-3-6b
|
217 |
+
save_safetensors: true
|
218 |
+
adapter: qlora
|
219 |
+
lora_model_dir:
|
220 |
+
|
221 |
+
sequence_len: 8192
|
222 |
+
sample_packing: false
|
223 |
+
pad_to_sequence_len: false
|
224 |
+
|
225 |
+
lora_r: 128
|
226 |
+
lora_alpha: 128
|
227 |
+
lora_dropout: 0.05
|
228 |
+
lora_target_modules:
|
229 |
+
lora_target_linear: true
|
230 |
+
lora_fan_in_fan_out:
|
231 |
+
|
232 |
+
|
233 |
+
wandb_project: llama-3-6b
|
234 |
+
wandb_entity:
|
235 |
+
wandb_watch:
|
236 |
+
wandb_name:
|
237 |
+
wandb_log_model:
|
238 |
+
|
239 |
+
gradient_accumulation_steps: 8
|
240 |
+
micro_batch_size: 2
|
241 |
+
num_epochs: 2
|
242 |
+
optimizer: paged_adamw_32bit
|
243 |
+
lr_scheduler: cosine
|
244 |
+
learning_rate: 2e-4
|
245 |
+
|
246 |
+
train_on_inputs: false
|
247 |
+
group_by_length: false
|
248 |
+
bf16: auto
|
249 |
+
fp16:
|
250 |
+
tf32: false
|
251 |
+
|
252 |
+
gradient_checkpointing: true
|
253 |
+
early_stopping_patience:
|
254 |
+
resume_from_checkpoint:
|
255 |
+
local_rank:
|
256 |
+
logging_steps: 1
|
257 |
+
xformers_attention:
|
258 |
+
flash_attention: true
|
259 |
+
|
260 |
+
warmup_steps: 100
|
261 |
+
evals_per_epoch: 4
|
262 |
+
eval_table_size:
|
263 |
+
save_steps: 4000
|
264 |
+
debug:
|
265 |
+
deepspeed:
|
266 |
+
weight_decay: 0.0
|
267 |
+
fsdp:
|
268 |
+
fsdp_config:
|
269 |
+
special_tokens:
|
270 |
+
pad_token: "<|reserved_special_token_0|>"
|
271 |
+
|
272 |
+
|
273 |
+
```
|
274 |
+
|
275 |
+
</details><br>
|