Remek commited on
Commit
9b92643
·
verified ·
1 Parent(s): 73be4c0

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,222 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - pl
4
+ license: cc-by-nc-4.0
5
+ library_name: transformers
6
+ tags:
7
+ - finetuned
8
+ - autoquant
9
+ - awq
10
+ inference:
11
+ parameters:
12
+ temperature: 0.6
13
+ widget:
14
+ - messages:
15
+ - role: user
16
+ content: Co przedstawia polskie godło?
17
+ ---
18
+
19
+ <p align="center">
20
+ <img src="https://huggingface.co/speakleash/Bielik-7B-Instruct-v0.1/raw/main/speakleash_cyfronet.png">
21
+ </p>
22
+
23
+ # Bielik-7B-Instruct-v0.1
24
+
25
+ The Bielik-7B-Instruct-v0.1 is an instruct fine-tuned version of the [Bielik-7B-v0.1](https://huggingface.co/speakleash/Bielik-7B-v0.1). Forementioned model stands as a testament to the unique collaboration between the open-science/open-souce project SpeakLeash and the High Performance Computing (HPC) center: ACK Cyfronet AGH. Developed and trained on Polish text corpora, which has been cherry-picked and processed by the SpeakLeash team, this endeavor leverages Polish large-scale computing infrastructure, specifically within the PLGrid environment, and more precisely, the HPC centers: ACK Cyfronet AGH. The creation and training of the Bielik-7B-Instruct-v0.1 was propelled by the support of computational grant number PLG/2024/016951, conducted on the Helios supercomputer, enabling the use of cutting-edge technology and computational resources essential for large-scale machine learning processes. As a result, the model exhibits an exceptional ability to understand and process the Polish language, providing accurate responses and performing a variety of linguistic tasks with high precision.
26
+
27
+ ## Model
28
+
29
+ The [SpeakLeash](https://speakleash.org/) team is working on their own set of instructions in Polish, which is continuously being expanded and refined by annotators. A portion of these instructions, which had been manually verified and corrected, has been utilized for training purposes. Moreover, due to the limited availability of high-quality instructions in Polish, publicly accessible collections of instructions in English were used - [OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) and [orca-math-word-problems-200k](https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k), which accounted for half of the instructions used in training. The instructions varied in quality, leading to a deterioration in model’s performance. To counteract this while still allowing ourselves to utilize forementioned datasets,several improvements were introduced:
30
+ * Weighted tokens level loss - a strategy inspired by [offline reinforcement learning](https://arxiv.org/abs/2005.01643) and [C-RLFT](https://arxiv.org/abs/2309.11235)
31
+ * Adaptive learning rate inspired by the study on [Learning Rates as a Function of Batch Size](https://arxiv.org/abs/2006.09092)
32
+ * Masked user instructions
33
+
34
+ Bielik-7B-Instruct-v0.1 has been trained with the use of an original open source framework called [ALLaMo](https://github.com/chrisociepa/allamo) implemented by [Krzysztof Ociepa](https://www.linkedin.com/in/krzysztof-ociepa-44886550/). This framework allows users to train language models with architecture similar to LLaMA and Mistral in fast and efficient way.
35
+
36
+ ### Model description:
37
+
38
+ * **Developed by:** [SpeakLeash](https://speakleash.org/)
39
+ * **Language:** Polish
40
+ * **Model type:** causal decoder-only
41
+ * **Finetuned from:** [Bielik-7B-v0.1](https://huggingface.co/speakleash/Bielik-7B-v0.1)
42
+ * **License:** CC BY NC 4.0 (non-commercial use)
43
+ * **Model ref:** speakleash:e38140bea0d48f1218540800bbc67e89
44
+
45
+ ## Training
46
+
47
+ * Framework: [ALLaMo](https://github.com/chrisociepa/allamo)
48
+ * Visualizations: [W&B](https://wandb.ai)
49
+
50
+ <p align="center">
51
+ <img src="https://huggingface.co/speakleash/Bielik-7B-Instruct-v0.1/raw/main/sft_train_loss.png">
52
+ </p>
53
+ <p align="center">
54
+ <img src="https://huggingface.co/speakleash/Bielik-7B-Instruct-v0.1/raw/main/sft_train_ppl.png">
55
+ </p>
56
+ <p align="center">
57
+ <img src="https://huggingface.co/speakleash/Bielik-7B-Instruct-v0.1/raw/main/sft_train_lr.png">
58
+ </p>
59
+
60
+ ### Training hyperparameters:
61
+
62
+ | **Hyperparameter** | **Value** |
63
+ |-----------------------------|------------------|
64
+ | Micro Batch Size | 1 |
65
+ | Batch Size | up to 4194304 |
66
+ | Learning Rate (cosine, adaptive) | 7e-6 -> 6e-7 |
67
+ | Warmup Iterations | 50 |
68
+ | All Iterations | 55440 |
69
+ | Optimizer | AdamW |
70
+ | β1, β2 | 0.9, 0.95 |
71
+ | Adam_eps | 1e−8 |
72
+ | Weight Decay | 0.05 |
73
+ | Grad Clip | 1.0 |
74
+ | Precision | bfloat16 (mixed) |
75
+
76
+
77
+ ### Instruction format
78
+
79
+ In order to leverage instruction fine-tuning, your prompt should be surrounded by `[INST]` and `[/INST]` tokens. The very first instruction should start with the beginning of a sentence token. The generated completion will be finished by the end-of-sentence token.
80
+
81
+ E.g.
82
+ ```
83
+ prompt = "<s>[INST] Jakie mamy pory roku? [/INST]"
84
+ completion = "W Polsce mamy 4 pory roku: wiosna, lato, jesień i zima.</s>"
85
+ ```
86
+
87
+ This format is available as a [chat template](https://huggingface.co/docs/transformers/main/chat_templating) via the `apply_chat_template()` method:
88
+
89
+ ```python
90
+ from transformers import AutoModelForCausalLM, AutoTokenizer
91
+
92
+ device = "cuda" # the device to load the model onto
93
+
94
+ model_name = "speakleash/Bielik-7B-Instruct-v0.1"
95
+
96
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
97
+ model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
98
+
99
+ messages = [
100
+ {"role": "user", "content": "Jakie mamy pory roku w Polsce?"},
101
+ {"role": "assistant", "content": "W Polsce mamy 4 pory roku: wiosna, lato, jesień i zima."},
102
+ {"role": "user", "content": "Która jest najcieplejsza?"}
103
+ ]
104
+
105
+ input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt")
106
+
107
+ model_inputs = input_ids.to(device)
108
+ model.to(device)
109
+
110
+ generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
111
+ decoded = tokenizer.batch_decode(generated_ids)
112
+ print(decoded[0])
113
+ ```
114
+
115
+ ## Evaluation
116
+
117
+
118
+ Models have been evaluated on [Open PL LLM Leaderboard](https://huggingface.co/spaces/speakleash/open_pl_llm_leaderboard) 5-shot. The benchmark evaluates models in NLP tasks like sentiment analysis, categorization, text classification but does not test chatting skills. Here are presented:
119
+ - Average - average score among all tasks normalized by baseline scores
120
+ - Reranking - reranking task, commonly used in RAG
121
+ - Reader (Generator) - open book question answering task, commonly used in RAG
122
+ - Perplexity (lower is better) - as a bonus, does not correlate with other scores and should not be used for model comparison
123
+
124
+
125
+
126
+ | | Average | RAG Reranking | RAG Reader | Perplexity |
127
+ |--------------------------------------------------------------------------------------|----------:|--------------:|-----------:|-----------:|
128
+ | **7B parameters models:** | | | | |
129
+ | Baseline (majority class) | 0.00 | 53.36 | - | - |
130
+ | Voicelab/trurl-2-7b | 18.85 | 60.67 | 77.19 | 1098.88 |
131
+ | meta-llama/Llama-2-7b-chat-hf | 21.04 | 54.65 | 72.93 | 4018.74 |
132
+ | mistralai/Mistral-7B-Instruct-v0.1 | 26.42 | 56.35 | 73.68 | 6909.94 |
133
+ | szymonrucinski/Curie-7B-v1 | 26.72 | 55.58 | 85.19 | 389.17 |
134
+ | HuggingFaceH4/zephyr-7b-beta | 33.15 | 71.65 | 71.27 | 3613.14 |
135
+ | HuggingFaceH4/zephyr-7b-alpha | 33.97 | 71.47 | 73.35 | 4464.45 |
136
+ | internlm/internlm2-chat-7b-sft | 36.97 | 73.22 | 69.96 | 4269.63 |
137
+ | internlm/internlm2-chat-7b | 37.64 | 72.29 | 71.17 | 3892.50 |
138
+ | [Bielik-7B-Instruct-v0.1](https://huggingface.co/speakleash/Bielik-7B-Instruct-v0.1) | 39.28 | 61.89 | **86.00** | 277.92 |
139
+ | mistralai/Mistral-7B-Instruct-v0.2 | 40.29 | 72.58 | 79.39 | 2088.08 |
140
+ | teknium/OpenHermes-2.5-Mistral-7B | 42.64 | 70.63 | 80.25 | 1463.00 |
141
+ | openchat/openchat-3.5-1210 | 44.17 | 71.76 | 82.15 | 1923.83 |
142
+ | speakleash/mistral_7B-v2/spkl-all_sft_v2/e1_base/spkl-all_2e6-e1_70c70cc6 | 45.44 | 71.27 | 91.50 | 279.24 |
143
+ | Nexusflow/Starling-LM-7B-beta | 45.69 | 74.58 | 81.22 | 1161.54 |
144
+ | openchat/openchat-3.5-0106 | 47.32 | 74.71 | 83.60 | 1106.56 |
145
+ | berkeley-nest/Starling-LM-7B-alpha | **47.46** | **75.73** | 82.86 | 1438.04 |
146
+ | | | | | |
147
+ | **Models with different sizes:** | | | | |
148
+ | Azurro/APT3-1B-Instruct-v1 (1B) | -13.80 | 52.11 | 12.23 | 739.09 |
149
+ | Voicelab/trurl-2-13b-academic (13B) | 29.45 | 68.19 | 79.88 | 733.91 |
150
+ | upstage/SOLAR-10.7B-Instruct-v1.0 (10.7B) | 46.07 | 76.93 | 82.86 | 789.58 |
151
+ | | | | | |
152
+ | **7B parameters pretrained and continously pretrained models:** | | | | |
153
+ | OPI-PG/Qra-7b | 11.13 | 54.40 | 75.25 | 203.36 |
154
+ | meta-llama/Llama-2-7b-hf | 12.73 | 54.02 | 77.92 | 850.45 |
155
+ | internlm/internlm2-base-7b | 20.68 | 52.39 | 69.85 | 3110.92 |
156
+ | [Bielik-7B-v0.1](https://huggingface.co/speakleash/Bielik-7B-v0.1) | 29.38 | 62.13 | **88.39** | 123.31 |
157
+ | mistralai/Mistral-7B-v0.1 | 30.67 | 60.35 | 85.39 | 857.32 |
158
+ | internlm/internlm2-7b | 33.03 | 69.39 | 73.63 | 5498.23 |
159
+ | alpindale/Mistral-7B-v0.2-hf | 33.05 | 60.23 | 85.21 | 932.60 |
160
+ | speakleash/mistral-apt3-7B/spi-e0_hf | 35.50 | 62.14 | **87.48** | 132.78 |
161
+
162
+ SpeakLeash models have one of the best scores in the RAG Reader task.
163
+ We have managed to increase Average score by almost 9 pp. in comparison to Mistral-7B-v0.1.
164
+ In our subjective evaluations of chatting skills SpeakLeash models perform better than other models with higher Average scores.
165
+
166
+
167
+
168
+ ## Limitations and Biases
169
+
170
+ Bielik-7B-Instruct-v0.1 is a quick demonstration that the base model can be easily fine-tuned to achieve compelling and promising performance. It does not have any moderation mechanisms. We're looking forward to engaging with the community in ways to make the model respect guardrails, allowing for deployment in environments requiring moderated outputs.
171
+
172
+ Bielik-7B-Instruct-v0.1 can produce factually incorrect output, and should not be relied on to produce factually accurate data. Bielik-7B-Instruct-v0.1 was trained on various public datasets. While great efforts have been taken to clear the training data, it is possible that this model can generate lewd, false, biased or otherwise offensive outputs.
173
+
174
+ ## License
175
+
176
+ Because of an unclear legal situation, we have decided to publish the model under CC BY NC 4.0 license - it allows for non-commercial use. The model can be used for scientific purposes and privately, as long as the license conditions are met.
177
+
178
+ ## Citation
179
+ Please cite this model using the following format:
180
+
181
+ ```
182
+ @misc{Bielik7Bv01,
183
+ title = {Introducing Bielik-7B-Instruct-v0.1: Instruct Polish Language Model},
184
+ author = {Ociepa, Krzysztof and Flis, Łukasz and Wróbel, Krzysztof and Kondracki, Sebastian and {SpeakLeash Team} and {Cyfronet Team}},
185
+ year = {2024},
186
+ url = {https://huggingface.co/speakleash/Bielik-7B-Instruct-v0.1},
187
+ note = {Accessed: 2024-04-01}, % change this date
188
+ urldate = {2024-04-01} % change this date
189
+ }
190
+ ```
191
+
192
+ ## Responsible for training the model
193
+
194
+ * [Krzysztof Ociepa](https://www.linkedin.com/in/krzysztof-ociepa-44886550/)<sup>SpeakLeash</sup> - team leadership, conceptualizing, data preparation, process optimization and oversight of training
195
+ * [Łukasz Flis](https://www.linkedin.com/in/lukasz-flis-0a39631/)<sup>Cyfronet AGH</sup> - coordinating and supervising the training
196
+ * [Krzysztof Wróbel](https://www.linkedin.com/in/wrobelkrzysztof/)<sup>SpeakLeash</sup> - benchmarks
197
+ * [Sebastian Kondracki](https://www.linkedin.com/in/sebastian-kondracki/)<sup>SpeakLeash</sup> - coordinating and preparation of instructions
198
+ * [Maria Filipkowska](https://www.linkedin.com/in/maria-filipkowska/)<sup>SpeakLeash</sup> - preparation of instructions
199
+ * [Paweł Kiszczak](https://www.linkedin.com/in/paveu-kiszczak/)<sup>SpeakLeash</sup> - preparation of instructions
200
+ * [Adrian Gwoździej](https://www.linkedin.com/in/adrgwo/)<sup>SpeakLeash</sup> - data quality and instructions cleaning
201
+ * [Igor Ciuciura](https://www.linkedin.com/in/igor-ciuciura-1763b52a6/)<sup>SpeakLeash</sup> - instructions cleaning
202
+ * [Jacek Chwiła](https://www.linkedin.com/in/jacek-chwila/)<sup>SpeakLeash</sup> - instructions cleaning
203
+
204
+ The model could not have been created without the commitment and work of the entire SpeakLeash team, whose contribution is invaluable. Thanks to the hard work of many individuals, it was possible to gather a large amount of content in Polish and establish collaboration between the open-science SpeakLeash project and the HPC center: ACK Cyfronet AGH. Individuals who contributed to the creation of the model through their commitment to the open-science SpeakLeash project:
205
+ [Grzegorz Urbanowicz](https://www.linkedin.com/in/grzegorz-urbanowicz-05823469/),
206
+ [Szymon Baczyński](https://www.linkedin.com/in/szymon-baczynski/),
207
+ [Paweł Cyrta](https://www.linkedin.com/in/cyrta),
208
+ [Jan Maria Kowalski](https://www.linkedin.com/in/janmariakowalski/),
209
+ [Karol Jezierski](https://www.linkedin.com/in/karol-jezierski/),
210
+ [Kamil Nonckiewicz](https://www.linkedin.com/in/kamil-nonckiewicz/),
211
+ [Izabela Babis](https://www.linkedin.com/in/izabela-babis-2274b8105/),
212
+ [Nina Babis](https://www.linkedin.com/in/nina-babis-00055a140/),
213
+ [Waldemar Boszko](https://www.linkedin.com/in/waldemarboszko),
214
+ [Remigiusz Kinas](https://www.linkedin.com/in/remigiusz-kinas/),
215
+ and many other wonderful researchers and enthusiasts of the AI world.
216
+
217
+ Members of the ACK Cyfronet AGH team:
218
+ [Szymon Mazurek](https://www.linkedin.com/in/sz-mazurek-ai/).
219
+
220
+ ## Contact Us
221
+
222
+ If you have any questions or suggestions, please use the discussion tab. If you want to contact us directly, join our [Discord SpeakLeash](https://discord.gg/3G9DVM39).
config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "Bielik-7B-Instruct-v0.1",
3
+ "architectures": [
4
+ "MistralForCausalLM"
5
+ ],
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 1,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "silu",
10
+ "hidden_size": 4096,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 14336,
13
+ "max_position_embeddings": 4096,
14
+ "model_type": "mistral",
15
+ "num_attention_heads": 32,
16
+ "num_hidden_layers": 32,
17
+ "num_key_value_heads": 8,
18
+ "quantization_config": {
19
+ "bits": 4,
20
+ "group_size": 128,
21
+ "modules_to_not_convert": null,
22
+ "quant_method": "awq",
23
+ "version": "gemm",
24
+ "zero_point": true
25
+ },
26
+ "rms_norm_eps": 1e-05,
27
+ "rope_theta": 10000.0,
28
+ "sliding_window": 4096,
29
+ "tie_word_embeddings": false,
30
+ "torch_dtype": "float16",
31
+ "transformers_version": "4.38.2",
32
+ "use_cache": true,
33
+ "vocab_size": 32000
34
+ }
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "do_sample": true,
5
+ "eos_token_id": 2,
6
+ "transformers_version": "4.38.2"
7
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:25094389acc33f46b93a397e4974bbc171565a711ad94c60c7797717974f7bb7
3
+ size 4150880232
special_tokens_map.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "unk_token": {
17
+ "content": "<unk>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ }
23
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dadfd56d766715c61d2ef780a525ab43b8e6da4de6865bda3d95fdef5e134055
3
+ size 493443
tokenizer_config.json ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "added_tokens_decoder": {
5
+ "0": {
6
+ "content": "<unk>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "1": {
14
+ "content": "<s>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "2": {
22
+ "content": "</s>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ }
29
+ },
30
+ "additional_special_tokens": [],
31
+ "bos_token": "<s>",
32
+ "chat_template": "{{ bos_token }}{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + eos_token }}{% endif %}{% endfor %}",
33
+ "clean_up_tokenization_spaces": false,
34
+ "eos_token": "</s>",
35
+ "legacy": true,
36
+ "model_max_length": 1000000000000000019884624838656,
37
+ "pad_token": null,
38
+ "sp_model_kwargs": {},
39
+ "spaces_between_special_tokens": false,
40
+ "tokenizer_class": "LlamaTokenizer",
41
+ "unk_token": "<unk>",
42
+ "use_default_system_prompt": false
43
+ }