jingyaogong commited on
Commit
c8ff9a6
·
verified ·
1 Parent(s): c0a5307

Upload 2 files

Browse files
Files changed (2) hide show
  1. README.md +287 -281
  2. README_en.md +284 -325
README.md CHANGED
@@ -1,4 +1,9 @@
 
 
1
  ![logo](./images/logo.png)
 
 
 
2
  <div align="center">
3
 
4
  ![visitors](https://visitor-badge.laobi.icu/badge?page_id=jingyaogong/minimind)
@@ -27,6 +32,14 @@
27
 
28
  ---
29
 
 
 
 
 
 
 
 
 
30
  # 📌 Introduction
31
 
32
  大语言模型(LLM)领域,如 GPT、LLaMA、GLM 等,虽然它们效果惊艳,
@@ -39,15 +52,14 @@
39
  因此,本项目的目标是把上手LLM的门槛无限降低,
40
  直接从0开始训练一个极其轻量的语言模型。
41
 
42
- (截至2024.09.01)MiniMind包含5个型号模型,最小仅需26M(0.02B),即可具备Amazing的对话能力!
 
43
 
44
- | 模型 (大小) | 速度 (Tokens/s) | 推理占用 | 训练占用(`batch_size=8`) | release | 主观评分(/100) |
45
- |------------------------|---------------|--------|----------------------|--------------------|------------|
46
- | MiniMind-small-T (26M) | 91.9 | 0.5 GB | 3.6 GB | 2024.08.28 | 55' |
47
- | MiniMind-small (56M) | 85.2 | 0.7 GB | 4.5 GB | 2024.08.28 | 55' |
48
- | MiniMind (218M) | 57.6 | 2.1 GB | 10.4 GB | 2024.08.28 | 75' |
49
- | MiniMind-MoE (166M) | 64.9 | 1.6 GB | 7.4 GB | 2024.08.28 | 40' |
50
- | MiniMind-V1 (108M) | 78.3 | 1.0 GB | 6.4 GB | 2024.09.01 (new🎉) | 80' |
51
 
52
  > 该分析在一个带有Torch 2.1.2、CUDA 12.2和Flash Attention 2的RTX 3090 GPU上运行。
53
 
@@ -57,19 +69,39 @@
57
 
58
  - 公开MiniMind模型代码(包含Dense和MoE模型)、Pretrain、SFT指令微调、LoRA微调、DPO偏好优化的全过程代码、数据集和来源。
59
  - 兼容`transformers`、`accelerate`、`trl`、`peft`等流行框架。
60
- - 训练支持单机单卡、单机多卡训练。训练过程中支持在任意位置停止,及在任意位置继续训练。
61
  - 在Ceval数据集上进行模型测试的代码。
62
  - 实现Openai-Api基本的chat接口,便于集成到第三方ChatUI使用(FastGPT、Open-WebUI等)。
63
 
64
  希望此开源项目可以帮助LLM初学者快速入门!
65
 
66
- 👉**最近更新**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
 
68
  <details close>
69
- <summary> <b>2024-09-01 (new🎉)</b> </summary>
70
- - 更新MiniMind-V1 (108M)模型,采用minimind_tokenizer,预训练轮次3 + SFT轮次10,更充分训练,性能更强。
71
  <summary> <b>2024-08-27</b> </summary>
72
- - 项目首次开源
 
 
73
  </details>
74
 
75
  # 📌 Environment
@@ -82,18 +114,23 @@
82
  * CUDA == 12.2
83
  * [requirements.txt](./requirements.txt)
84
 
85
- # 📌 Start Inference
86
 
87
  <div align="center" style="font-size: 1.5em; font-weight: bold;">
88
  <img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" alt="Hugging Face Logo" style="vertical-align: middle; height: 30px;" />
89
  Hugging Face
90
 
91
- [MiniMind-Collection](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5)
 
 
 
 
 
92
  </div>
93
 
94
  ```bash
95
  # step 1
96
- git clone https://huggingface.co/jingyaogong/minimind
97
  ```
98
 
99
  ```bash
@@ -101,8 +138,30 @@ git clone https://huggingface.co/jingyaogong/minimind
101
  python 2-eval.py
102
  ```
103
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
  # 📌 Quick Start
105
 
 
 
 
 
106
  * 1、克隆项目代码
107
  ```text
108
  git clone https://github.com/jingyaogong/minimind.git
@@ -111,50 +170,48 @@ python 2-eval.py
111
 
112
  * 2.1 下载[数据集下载地址](#数据集下载地址)放到`./dataset`目录下
113
 
114
- * 2.2 `python data_process.py`处理数据集,例如pretrain数据提前进行token-encoder、sft数据集抽离qa到csv文件。
115
 
116
- * 2.3 在`./model/LMConfig.py` 中调整model的参数配置。
117
- * 2.4 `python 1-pretrain.py` 执行预训练。
118
- * 2.5 `python 3-full_sft.py` 执行指令微调。
119
- * 2.6 `python 4-lora_sft.py` 执行lora微调(非必须)。
120
- * 2.7 `python 5-dpo_train.py` 执行DPO人类偏好强化学习对齐(非必须)。
121
  * 3、测试模型推理效果
122
- * 从下面【训练完成的模型权重】下载权重到`./out/`目录下
 
123
  ```text
124
  out
125
  ├── multi_chat
126
- │   ├── full_sft_1024.pth
127
  │   ├── full_sft_512.pth
128
- │   ├── full_sft_640_moe.pth
129
- │   └── full_sft_640.pth
130
  ├── single_chat
131
- │   ├── full_sft_1024.pth
132
  │   ├── full_sft_512.pth
133
- │   ├── full_sft_640_moe.pth
134
- │   └── full_sft_640.pth
135
- ├── full_sft_1024.pth
136
- ├── full_sft_512.pth
137
- ├── full_sft_640_moe.pth
138
- ├── full_sft_640.pth
139
- ├── pretrain_1024.pth
140
- ├── pretrain_640_moe.pth
141
- ├── pretrain_640.pth
142
  ```
143
  * `python 0-eval_pretrain.py`测试预训练模型的接龙效果
144
  * `python 2-eval.py`测试模型的对话效果
145
  ![2-eval](./images/2-eval.png)
146
 
147
- 🍭 【Tip】预训练和全参微调pretrain和full_sft均支持DDP多卡加速
148
-
149
- * 单机N卡启动训练
150
 
151
- ```text
 
152
  torchrun --nproc_per_node N 1-pretrain.py
153
- ```
154
-
155
- ```text
156
  torchrun --nproc_per_node N 3-full_sft.py
157
  ```
 
 
 
 
 
 
158
 
159
  # 📌 Data sources
160
 
@@ -166,29 +223,35 @@ python 2-eval.py
166
  因为LLM体积非常小,为了避免模型头重脚轻(词嵌入embedding层参数占整个LLM比太高),所以词表长度需要选择比较小。
167
  强大的开源模型例如01万物、千问、chatglm、mistral、Llama3等,它们的tokenizer词表长度如下:
168
 
169
- | Tokenizer 模型 | 词表大小 | 来源 |
170
- |--------------------|---------|------------|
171
- | yi tokenizer | 64,000 | 01万物(中国) |
172
- | qwen2 tokenizer | 151,643 | 阿里云(中国) |
173
- | glm tokenizer | 151,329 | 智谱AI(中国) |
174
- | mistral tokenizer | 32,000 | Mistral AI(法国) |
175
- | llama3 tokenizer | 128,000 | Meta(美国) |
176
- | minimind tokenizer | 6400 | 自定义 |
 
177
 
178
- > 尽管Mistral中文词语占比很少,编解码效率弱于qwen2、glm等中文友好型分词器。
179
- 但MiniMind这里选择了mistral tokenizer作为分词器以保持整体参数轻量,避免头重脚轻,因为mistral的词表大小只有32,000。
180
- 且MiniMind在实际测试中几乎没有出现过生僻词汇解码失败的情况,效果良好。
181
 
182
- > 方便对比测试效果,额外训练了一个自定义Tokenizer模型的版本**MiniMind-small-T**,自定义词表压缩长度到6400,使得LLM总参数进一步降低到26M左右。
 
 
 
183
 
184
  ---
185
 
186
- -
 
 
 
187
 
188
- 📙【Pretrain数据】:[seq-monkey通用文本数据集](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md)
189
- 是由多种公开来源的数据(如网页、百科、博客、开源代码、书籍等)汇总清洗而成。
190
- 整理成统一的JSONL格式,并经过了严格的筛选和去重,确保数据的全面性、规模、可信性和高质量。
191
- 总量大约在10B token,适合中文大语言模型的预训练。
192
 
193
  ---
194
 
@@ -222,12 +285,15 @@ python 2-eval.py
222
 
223
  ### 数据集下载地址
224
 
225
- | MiniMind训练数据集 | 下载地址 |
226
- |------------------|---------------------------------------------------------------------------------------------------------------|
227
- | **【Pretrain数据】** | [seq-monkey通用文本数据集](http://share.mobvoi.com:5000/sharing/O91blwPkY) |
228
- | **【SFT数据】** | [匠数大模型SFT数据集](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data/resolve/master/sft_data_zh.jsonl) |
229
- | **【DPO数据】** | [活字数据集1](https://huggingface.co/datasets/Skepsun/huozi_rlhf_data_json) |
230
- | **【DPO数据】** | [活字数据集2](https://huggingface.co/datasets/beyond/rlhf-reward-single-round-trans_chinese) |
 
 
 
231
 
232
  # 📌 Model
233
 
@@ -251,18 +317,15 @@ MiniMind的整体结构一致,只是在RoPE计算、推理函数和FFN层的
251
  ![](./images/LLM-structure.png)
252
  ![](./images/LLM-structure-moe.png)
253
 
254
- 模型配置见[./model/LMConfig.py](./model/LMConfig.py)。模型型号和参数见下表:
 
255
 
256
  | Model Name | params | len_vocab | n_layers | d_model | kv_heads | q_heads | share+route | TopK |
257
  |------------------|--------|-----------|----------|---------|----------|---------|-------------|------|
258
- | minimind-small-T | 26M | 6400 | 8 | 512 | 8 | 16 | - | - |
259
- | minimind-small | 56M | 32000 | 8 | 640 | 8 | 16 | - | - |
260
- | minimind | 218M | 32000 | 16 | 1024 | 8 | 16 | - | - |
261
- | minimind-MoE | 162M | 32000 | 8 | 640 | 8 | 16 | 2+4 | 2 |
262
- | minimind-V1 | 108M | 6400 | 16 | 768 | 8 | 16 | - | - |
263
 
264
- 此外作为参考,GPT3的层数和维度参数见下表:
265
- ![gpt3_config.png](./images/gpt3_config.png)
266
 
267
  # 📌 Experiment
268
 
@@ -273,13 +336,11 @@ CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
273
  环境:python 3.9 + Torch 2.1.2 + DDP多卡训练
274
  ```
275
 
276
- | Model Name | params | len_vocab | batch_size | pretrain_time | sft_single_time | sft_multi_time |
277
- |------------------|--------|-----------|------------|--------------------|-------------------|---------------------|
278
- | minimind-small-T | 26M | 6400 | 64 | ≈5 hour (1 epoch) | ≈2 hour (1 epoch) | ≈0.5 hour (1 epoch) |
279
- | minimind-small | 56M | 32000 | 24 | ≈6 hour (1 epoch) | ≈2 hour (1 epoch) | ≈0.5 hour (1 epoch) |
280
- | minimind | 218M | 32000 | 16 | ≈15 hour (1 epoch) | ≈5 hour (1 epoch) | ≈1 hour (1 epoch) |
281
- | minimind-MoE | 166M | 32000 | 16 | ≈13 hour (1 epoch) | ≈5 hour (1 epoch) | ≈1 hour (1 epoch) |
282
- | minimind-V1 | 108M | 6400 | 16 | ≈8 hour (1 epoch) | ≈3 hour (1 epoch) | ≈1 hour (1 epoch) |
283
 
284
  ---
285
 
@@ -287,7 +348,7 @@ CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
287
  - LLM首先要学习的并非直接与人交流,而是让肚子中充满知识的墨水,至于墨水理论上喝的越饱越好,产生大量的对世界的认知积累。
288
  - 预训练就是让Model先埋头苦学大量基本的知识,例如从维基百科、新闻、常识、书籍等。
289
  - 它无监督的从大量的文本数据中压缩知识到自己模型的权重,目的是:学会词语接龙。例如我们输入“秦始皇是”四个字,它在大量学习后能预测出下一句话大概率是“中国的第一位皇帝”。
290
- > pretrain的学习率设置为1e-4到1e-5的动态学习率,预训练epoch数设为2,预训练时间不到1天。
291
  ```bash
292
  torchrun --nproc_per_node 2 1-pretrain.py
293
  ```
@@ -297,7 +358,7 @@ CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
297
  ”后不再无脑接龙,而是意识��这是一段完整的对话结束。
298
  - 我们称这个过程为指令微调,就如同让学富五车的「牛顿」先生适应21世纪的聊天习惯,学习屏幕左侧是对方消息,右侧是本人消息这个规律。
299
  - 在训练时,MiniMind的指令和回答长度被截断在512,是为了节省显存空间。就像我们学习时,会先从短的文章开始,当学会阅读200字作文后,800字长文章就不需要再单独学习。
300
- > 在推理时通过调整RoPE线性差值,实现长度外推到1024或2048及以上很方便。学习率设置为1e-5到1e-6的动态学习率,微调epoch数为5
301
 
302
  ```bash
303
  # 3-full_sft.py中设置数据集为sft_data_single.csv
@@ -309,7 +370,7 @@ CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
309
  - 构建【问题->回答,问题->回答,问题->】的新聊天模板,然后使用这个数据集进行微调。
310
  - 学习完成的模型不仅仅只能回答当前问题,还能根据历史对话进行连贯的对话。
311
  - 这一步并非必须,因为小模型长上文对话能力很弱,强行对齐多轮问答模板会损失一定程度的单轮SFT效果。
312
- > 学习率设置为1e-5到1e-6的动态学习率,微调epoch数为2
313
  ```bash
314
  # 3-full_sft.py中设置数据集为sft_data.csv
315
  torchrun --nproc_per_node 2 3-full_sft.py
@@ -321,21 +382,9 @@ CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
321
  ```bash
322
  python 5-dpo_train.py
323
  ```
324
-
325
- ---
326
- 🔗训练完成的模型权重:
327
-
328
- | Model Name | params | Config | pretrain_model | single_sft_model | multi_sft_model |
329
- |------------------|--------|-------------------------------------------------|----------------------------------------------------------------|----------------------------------------------------------------|----------------------------------------------------------------|
330
- | minimind-small-T | 26M | d_model=512<br/>n_layers=8 | - | [链接](https://pan.baidu.com/s/1_COe0FQRDmeapSsvArahCA?pwd=6666) | [链接](https://pan.baidu.com/s/1GsGsWSL0Dckl0YPRXiBIFQ?pwd=6666) |
331
- | minimind-small | 56M | d_model=640<br/>n_layers=8 | [链接](https://pan.baidu.com/s/1nJuOpnu5115FDuz6Ewbeqg?pwd=6666) | [链接](https://pan.baidu.com/s/1lRX0IcpjNFSySioeCfifRQ?pwd=6666) | [链接](https://pan.baidu.com/s/1LzVxBpL0phtGUH267Undqw?pwd=6666) |
332
- | minimind | 218M | d_model=1024<br/>n_layers=16 | [链接](https://pan.baidu.com/s/1jzA7uLEi-Jen2fW5olCmEg?pwd=6666) | [链接](https://pan.baidu.com/s/1Hvt0Q_UB_uW2sWTw6w1zRQ?pwd=6666) | [链接](https://pan.baidu.com/s/1fau9eat3lXilnrG3XNhG5Q?pwd=6666) |
333
- | minimind-MoE | 166M | d_model=1024<br/>n_layers=8<br/>share+route=2+4 | [链接](https://pan.baidu.com/s/11CneDVTkw2Y6lNilQX5bWw?pwd=6666) | [链接](https://pan.baidu.com/s/1fRq4MHZec3z-oLK6sCzj_A?pwd=6666) | [链接](https://pan.baidu.com/s/1HC2KSM_-RHRtgv7ZDkKI9Q?pwd=6666) |
334
- | minimind-V1 | 108M | d_model=768<br/>n_layers=16 | - | [链接](https://pan.baidu.com/s/1p713loS7EfwHQf3G9eYI3Q?pwd=6666) | [链接](https://pan.baidu.com/s/12iHGpAs6R0kqsOnGtgK6vQ?pwd=6666) |
335
-
336
  ---
337
 
338
- 关于LLM的参数配置,有一篇很有意思的论文[MobileLLM](https://arxiv.org/pdf/2402.14905)做了详细的研究和实验。
339
  scaling law在小模型中有自己独特的规律。
340
  引起Transformer参数成规模变化的参数几乎只取决于`d_model`和`n_layers`。
341
 
@@ -348,176 +397,151 @@ MobileLLM提出架构的深度比宽度更重要,「深而窄」的「瘦长
348
  例如当模型参数固定在125M或者350M时,30~42层的「狭长」模型明显比12层左右的「矮胖」模型有更优越的性能,
349
  在常识推理、问答、阅读理解等8个基准测试上都有类似的趋势。
350
  这其实是非常有趣的发现,因为以往为100M左右量级的小模型设计架构时,几乎没人尝试过叠加超过12层。
351
-
352
  这与MiniMind在训练过程中,模型参数量在`d_model`和`n_layers`之间进行调整实验观察到的效果是一致的。
353
  然而「深而窄」的「窄」也是有维度极限的,当d_model<512时,词嵌入维度坍塌的劣势非常明显,
354
  增加的layers并不能弥补词嵌入在固定q_head带来d_head不足的劣势。
355
  当d_model>1536时,layers的增加似乎比d_model的优先级更���,更能带来具有“性价比”的参数->效果增益。
356
- 因此MiniMind设定small模型的d_model=640,n_layers=8来获取的「极小体积<->更好效果」的平衡。
357
- 设定d_model=1024,n_layers=16来获取效果的更大收益,更加符合小模型scaling-law的变化曲线。
358
 
359
- # 📌 Eval
360
 
361
- > 【注】以下测试于2024.8.28完成,此日期后发布的(例如MiniMind-V1)新模型,无特殊需要时将不加入测试。
 
 
 
 
 
 
 
 
 
 
 
 
 
 
362
 
363
- [A] [minimind-small-T(0.02B)](https://pan.baidu.com/s/1_COe0FQRDmeapSsvArahCA?pwd=6666)<br/>
364
- [B] [minimind-small(0.05B)](https://pan.baidu.com/s/1lRX0IcpjNFSySioeCfifRQ?pwd=6666)<br/>
365
- [C] [minimind-MoE(0.16B)](https://pan.baidu.com/s/1fRq4MHZec3z-oLK6sCzj_A?pwd=6666)<br/>
366
- [D] [minimind(0.2B)](https://pan.baidu.com/s/1Hvt0Q_UB_uW2sWTw6w1zRQ?pwd=6666)<br/>
367
- [E] [baby-llama2-chinese(0.2B)](https://github.com/DLLXW/baby-llama2-chinese)<br/>
368
- [F] [chatlm-mini-chinese(0.2B)](https://github.com/charent/ChatLM-mini-Chinese)<br/>
369
 
370
- ```angular2html
371
- [Q]: 一星期有多少天?
372
- [A]: 一星期有七天。
373
- [B]: 一星期有7天。
374
- [C]: 根据您的计算,一星期有12个小时。因此,我们需要将小时数乘以12。这样,我们就可以得到12的小时数。
375
- [D]: 一星期共有七天。
376
- [E]: 作为一个人工智能助手,我无法确定您所描述的星期数。 如果您能够提供您所描述的星期数,我可以为您提供有关该星期天数的信息。
377
- [F]: 一星期有七天,也就是七天的积,也就是 7 x 7 = 24 天。这个数字是通过将 24 除以 7 得到的,因为一周有七天,所以一星期有 24 x 7 = 24 天。
378
 
 
 
 
 
 
379
 
 
380
  [Q]: 你知道长江吗?
381
  [A]: 是的,长江是中国最长的河流,全长超过3300千米。
382
  [B]: 是的,长江是中国最长的河流,全长约7100公里。它的流域面积约为3328万平方公里,是中国最大的河流之一,也是世界第三大河流,总长度约为6300千米。
383
- [C]: 我知道。长江是我国的一个重要水系,有着悠久的历史和文化。长江是中华民族的象征和精神支柱,也是世界上最美丽的自然景观之一。长江沿岸地区的人们,特别是在长江流域生活了数千年。长江在中国的历史和文化中,尤其是历史悠久的文化,有着深厚的历史和文化底蕴。
384
- [D]: 长江是中国的第一大河流,全长约6380公里。它是亚洲最长的河流,也是中国第一大河流。长江流域是中国的重要河流之一,有着丰富的自然资源和文化遗产。长江流域是中国最重要的农业和经济发展区,拥有众多的文化遗产和历史景点,是中国最重要的文化旅游目的地之一。
385
- [E]: 同时也知道,中国是世界上最长的城市之一,拥有超过3400万人口。
386
- [F]: 是的,长江是中国最长的河流,也是世界上最长的河流之一。它发源于青藏高原,流经九个省区,最终注入东海。长江流域的气候温和,四季分明,是中国重要的农业和工业基地。长江流域的水产资源丰富,是中国的重���渔业资源之一。
387
 
388
 
389
  [Q]: 第一颗人造卫星是哪个国家发射的?
390
  [A]: 第一颗人造卫星是斯普特尼克1号。
391
  [B]: 第一颗人造卫星是美国发射的。
392
- [C]: 第一颗人造卫星是中国国家航天科技集团有限公司的卫星,它是中国航天科技集团公司的卫星,由中国航天科技集团公司发射。
393
- [D]: 第一颗人造卫星是苏联于1957年发射的。
394
- [E]: 这颗名为“阿姆斯特朗”的卫星,于19611120日发射,是第一个成功将人类送入太空的卫星。
395
- [F]: 第一颗人造卫星是苏联于1957年4月4日发射的,它是由苏联宇航员尤里·加加林(Yuri Gagarin)和苏联工程师罗尔德·谢尔盖耶夫(Roald Skohley)共同发射的。在发射过程中,加加林和谢尔盖耶夫使用了一个名为“斯普特尼克1号”的小型卫星,它的直径为4.3米,重量为1,051公斤。这枚卫星被称为“斯普特尼克1号”,是苏联太空竞赛中的一个里程碑,也是人类历史上第一次在太空中进行太空行走。
396
 
397
 
398
  [Q]: 你知道泰山在哪里吗?
399
  [A]: 泰山位于中国东部地区,是中国五岳之一。
400
  [B]: 泰山位于山东省泰安市,是中国五岳之一。
401
- [C]: 泰山在哪里呢?
402
- [D]: 泰山位于中国山东省昆山市,是中国古代文化的瑰宝之一,也是世界著名的山岳风景区之一。泰山是五岳之一,以其险峻的山峰、秀丽的自然风光和丰富的文化底蕴而闻名于世。
403
- [E]: 作为一个人工智能助手,我无法访问互联网,因此无法提供关于你所在位置的信息。不过,你可以通过搜索或搜索相关关键词来获取有关你所在位置的信息,或者查看相关新闻或网站来获取有关你所在位置的信息。
404
- [F]: 泰山位于中国山东省泰安市境内,是中国五岳之一,也是中国文化和历史的重要象征之一。它是中国五岳之一,也是中国道教和佛教的发源地之一。泰山的海拔高度为1,545米,是中国五岳之一,也是中国最著名的旅游景点之一。它的历史可以追溯到公元前5世纪,是中国古代帝王祭祀的场所。泰山也是中国文化的重要组成部分,被誉为“五岳之首”,是中国文化的重要遗产之一。
405
 
406
 
407
  [Q]: 世界上最高的山峰是什么?
408
  [A]: 世界上最高的山峰是珠穆朗玛峰,位于尼泊尔和中国的边境上,海拔8848米。
409
  [B]: 珠穆朗玛峰是世界上最高的山峰,位于喜马拉雅山脉,海拔8848米。
410
- [C]: 世界上最高的山峰是珠穆朗玛峰。
411
- [D]: 世界上最高的山峰是珠穆朗玛峰,位于喜马拉雅山脉,海拔8,848.86米。
412
- [E]: 世界上最高的山峰是���穆朗玛峰,位于喜马拉雅山脉,海拔8,848.86米(29,031.69英尺)。
413
- [F]: 世界上最高的山峰是珠穆朗玛峰,也称为萨加玛塔,位于喜马拉雅山脉的尼泊尔和中国边境之间,海拔8,848.86米(29,031.69英尺)。它是世界上最著名的登山目标之一,吸引着来自世界各地的登山者前来挑战。珠穆朗玛峰的名字来自藏语,意思是“女神的额峰”,它是由两个不同的山峰组成的,其中一个是位于尼泊尔的珠穆朗玛峰,另一个则位于中国。它的高度是通过测量海拔高度来确定的,因为它是通过测量地球的平均高度来确定的。尽管珠穆朗玛峰是世界上最高的山峰,但它并不是地球上任何其他地方都能到达的。地球上的其他地方,如南美洲的安第斯山脉,也是世界上最高的山脉,但它并不是地球上最高的山峰。总之,珠穆朗玛峰是世界上最高的山峰,但它并不是地球上任何地方都能到达的。它是地球上最高的山脉,但它并不是地球上任何地方都能到达的。
414
 
415
 
416
  [Q]: 万有引力是谁提出的?
417
  [A]: 万有引力是由英国物理学家牛顿提出的。他认为,所有物体都会产生引力,这个引力是由物体的质量和距离决定的。这个理论揭示了质量和距离之间的关系,并证明了万有引力的本质。
418
  [B]: 万有引力是由牛顿发现的,它是由英国物理学家詹姆斯·瓦特于1879年提出的。
419
- [C]: 万有引力是谁提出的?
420
- [D]: 万有引力是由牛顿提出的,他发现每个物体都会对其他物体施加引力,这些力的大小取决于物体的质量和距离。
421
- [E]: 引力是宇宙中一种力量,它是由质量或能量所组成的。根据牛顿的万有引力定律,一个物体之间的引力大小与它的质量成正比,与它的质量成反比。因此,牛顿的万有引力定律表明,一个物体之间的引力大小与它的质量成正比,与它的质量成反比。这个定律在物理学、天文学和物理学中都有广泛的应用。
422
- [F]: 万有引力是由英国物理学家艾萨克·牛顿于1687年提出的。他在1687年发表了一篇著名的论文,题为《自然哲学的数学原理》,其中提出了万有引力定律,指出任何两个物体之间都存在着一种引力,这种引力的大小与它们的质量成正比,与它们之间的距离的平方成反比。这个定律被称为万有引力定律,它描述了质量和距离之间的关系。它表明,两个物体之间的引力与它们的质量成正比,与它们之间的距离的平方成反比。这个定律对于解释宇宙中的许多现象和现象之间的相互作用非常重要,例如行星绕太阳运动、恒星之间的引力透镜效应等等。它也被广泛应用于天文学、物理学和工程学等领域,为我们理解宇宙的本质和演化提供了重要的理论基础。
423
 
424
 
425
  [Q]: 你知道大熊猫的主要食物是什么吗?
426
  [A]: 大熊猫的主要食物是竹子。它们通常吃竹子,每天需要吃掉大约30公斤的竹子。
427
  [B]: 大熊猫主要是竹子,它们的主要食物是竹子。大熊猫的消化系统比较特殊,它们每天需要消化大量的竹子才能维持其生存需要。除了竹子外,大熊猫还在野外生存。
428
- [C]: 大熊猫的主要食物是竹子、蘑菇、蘑菇、甜菜和蘑菇。
429
- [D]: 大熊猫是一种以竹子为食的哺乳动物,但主要食物来源是竹笋、竹叶、茎、茎等。
430
- [E]: 大熊猫的主要食物是竹子。
431
- [F]: 大熊猫是一种以竹子为主要食物的植食动物,它们的肠道结构适合消化纤维素含量高的竹子,但也会吃一些其他植物,如草本植物和小型哺乳动物。大熊猫的肠道结构也适应了高纤维的竹子,这使得它们能够消化大量的纤维素,从而帮助它们在野外生存。大熊猫的肠道中还含有一种叫做“纤维素酶”的酶类物质,这种酶能够帮助大熊猫分解纤维素,从而帮助它们消化大量的水分和营养物质。此外,大熊猫的肠道还含有一种叫做“纤维素蛋白”的蛋白质,这种蛋白质能够帮助大熊猫维持体内的能量水平,从而帮助它们在野外生存。总的来说,大熊猫的主要食物是竹子,但它们也会吃一些其他植物,如草本植物和小型哺乳动物。
432
 
433
  [Q]: 海水为什么是咸的?
434
  [A]: 海水中含有许多盐分,这些盐分可以导致海水咸度的增加。海水中的盐分主要来自于海水中的盐分,这些盐分可以从海水中的盐分中获得。海水中的氧化物和盐分可以与海水中的盐分形成氢键,从而使海水咸度增加。
435
  [B]: 海水中含有丰富的钙、磷、钾和磷等元素,这些元素在海洋中的数量和分布非常有限。此外,海水中还含有各种生物和微生物的营养物质,如贝壳和海藻等。
436
- [C]: 咸的是海水,因为海水的溶解度高于海水。海水的碳酸盐和盐基度也高于海水,这是因为海水的碳酸盐和盐基度都非常高,海水的碳酸盐和盐基度也非常高。
437
- [D]: 海水中含有溶解盐类和其他矿物质,这些物质来自于地壳的侵蚀和吸附作用,而这些物质会溶解在水中,导致海水呈现咸味。
438
- [E]: 因为水是许多生物的生存所必需的,它能够吸收和释放二氧化碳,释放氧气,维持大气中的水蒸气,以及维持水循环。水中的盐类和化学物质会释放出氧气,使地球上的生物呼吸。此外,水还具有许多其他功能,如保持水循环、维持水循环、调节气候和环境等。水是许多生物生存所必需的,它也是许多自然现象的重要基础。
439
- [F]: 海水是咸的主要原因是因为其中含有大量的溶解盐类,其中最主要的是氯化钠(NaCl)。氯化钠是一种非常常见的盐类,它存在于地球上的大部分地区,包括海洋、湖泊、河流和地下水。当水从海洋中蒸发时,它会带走一部分盐分,使海水变得咸。此外,海中还含有其他溶解盐类的物质,如硫酸盐、硫酸镁、硫酸钙、硫酸镁盐、硫酸镁钾、硫酸镁钠、硫酸镁钙、硫酸镁镁钙、硫酸镁镁钙、硫酸镁镁钙、硫酸镁镁钠、硫酸镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁镁钙、硫酸镁镁镁镁钙、硫酸镁镁镁镁钙、硫酸镁镁镁镁钙、硫酸镁镁镁镁钙、硫酸镁镁镁镁钙、硫酸镁镁镁镁镁钙、硫酸镁镁镁镁镁钙、硫酸镁镁镁镁镁钙、硫酸镁镁镁镁镁钙、硫酸镁镁镁镁镁镁钙、硫酸镁镁镁镁镁镁钙、硫酸镁镁镁镁镁镁钙、硫酸镁镁镁镁镁镁镁钙、硫酸镁镁镁镁
440
  ```
441
 
442
- 🙋‍♂️直接把上述模型的回答丢给GPT-4o,让它帮忙打个分:
 
443
 
444
  ---
445
- 根据你的要求,我将根据准确性、清晰性和完整性来评价每个模型的表现,并给出评分和排序。
446
-
447
- ### 模型评分和表现总结:
448
-
449
- **A模型**
450
-
451
- - **准确性**:大部分回答准确,但偶尔会有轻微错误。
452
- - **清晰性**:回答简洁明了,语言流畅。
453
- - **完整性**:回答内容有时略显简单,但整体信息足够。
454
- - **评分**:80分
455
-
456
- **B模型**
457
-
458
- - **准确性**:多数回答准确,但有部分小错误(如第一颗人造卫星的问题)。
459
- - **清晰性**:语言较为清晰,但有时表达稍显混乱。
460
- - **完整性**:回答内容较全面,但存在信息误差。
461
- - **评分**:75分
462
-
463
- **C模型**
464
-
465
- - **准确性**:回答内容不准确,且多次出现自问自答的情况。
466
- - **清晰性**:语言流畅,但回答内容的逻辑性差。
467
- - **完整性**:信息不完整,有时缺乏重要细节。
468
- - **评分**:55分
469
-
470
- **D模型**
471
 
472
- - **准确性**:大多数回答准确,基本符合事实。
473
- - **清晰性**:表达清晰,信息量适中。
474
- - **完整性**:回答较为完整,但有些答案可能包含不必要的细节。
475
- - **评分**:85分
476
 
477
- **E模型**
 
 
478
 
479
- - **准确性**:准确度较低,部分回答甚至与问题无关。
480
- - **清晰性**:表达不够清晰,容易引起混淆。
481
- - **完整性**:信息不完整,且有时偏离主题。
482
- - **评分**:50分
483
 
484
- **F模型**
 
 
485
 
486
- - **准确性**:部分回答不准确,且有明显错误(如“24天”)。
487
- - **清晰性**:表达冗长,容易造成混淆。
488
- - **完整性**:信息过度冗长,且有重复内容,降低了答案的可读性。
489
- - **评分**:60分
490
 
491
- ### 排序(从高到低):
 
 
492
 
493
- 1. **D模型** - 85分
494
- 2. **A模型** - 80分
495
- 3. **B模型** - 75分
496
- 4. **F模型** - 60分
497
- 5. **C模型** - 55分
498
- 6. **E模型** - 50分
499
 
500
- 这些评分和排序基于每个模型在准确性、清晰性和完整性三个方面的综合表现。
 
 
501
 
502
  ---
503
 
504
  ## 👉效果总结
505
 
506
- * minimind系列(ABCD)的排序符合直觉,minimind(0.2B)评分最高,常识性问题的回答基本没有错误和幻觉。
507
- * 出乎意料的是,minimind-small-T(0.02B)仅有26M参数,却可以接近minimind(0.2B)的表现。
508
- * minimind(0.2B)的sft轮数`epochs`仅有不到2,因为训练时间是0.02B的好几倍,所以偷懒提前kill腾出资源给小模型,0.2B没有得到充分训练的情况下依然做到了最强,其实还是底大一级压死人。
509
- * minimind-MoE(0.16B)表现很差,甚至不如它同配置的dense模型minimind(0.05B)
510
- ,其实这并非MoE的锅。同样是因为偷懒提前kill腾出资源给小模型,但是MoE模型多专家模式需要的训练轮次本来就需要酌情更高,在epochs设置为2时训练的极其不充分。minimind不久前实验阶段在Yi
511
- tokenizer上试验过MoE的充分训练版本,可以做到比dense表现肉眼可见的好。现在先这样了hh,日后腾出服务器再训练更新v2 v3版本。
512
- *
513
 
514
- F模型的回答看起来是这里最完美的,尽管存在些许幻觉瞎编的情况。但GPT-4o和kimi的评分都一致认为它“信息过度冗长,且有重复内容,存在幻觉”。其实这种评价太严格了,100个字中有10个字是幻觉,就很容易把它归到0分。由于F模型训练文本默认长度更长,数据集大得多,所以回答的看起来很完备,在体积近似的情况下,数据比模型更重要得多。
515
 
516
- > 🙋‍♂️个人主观评价:F>D>A≈B>C>E
 
517
 
518
- > 🤖GPT-4o评价:D>A>B>F>C>E
519
 
520
- 总而言之scaling law:模型参数越大,训练数据越多模型的性能越强。
 
 
521
 
522
  # 📌 Objective dataset: C-Eval
523
 
@@ -528,60 +552,14 @@ minimind模型本身没有使用较大的数据集训练,也没有针对回答
528
 
529
  > 例如minimind-small的结果细项:
530
 
531
- | 类别 | 正确数量/总题数 | 正确率 |
532
- |----------------------------------------------|----------|--------|
533
- | probability_and_statistics_val | 3/18 | 16.67% |
534
- | law_val | 5/24 | 20.83% |
535
- | middle_school_biology_val | 4/21 | 19.05% |
536
- | high_school_chemistry_val | 7/19 | 36.84% |
537
- | high_school_physics_val | 5/19 | 26.32% |
538
- | legal_professional_val | 2/23 | 8.70% |
539
- | high_school_chinese_val | 4/19 | 21.05% |
540
- | high_school_history_val | 6/20 | 30.00% |
541
- | tax_accountant_val | 10/49 | 20.41% |
542
- | modern_chinese_history_val | 4/23 | 17.39% |
543
- | middle_school_physics_val | 4/19 | 21.05% |
544
- | middle_school_history_val | 4/22 | 18.18% |
545
- | basic_medicine_val | 1/19 | 5.26% |
546
- | operating_system_val | 3/19 | 15.79% |
547
- | logic_val | 4/22 | 18.18% |
548
- | electrical_engineer_val | 7/37 | 18.92% |
549
- | civil_servant_val | 11/47 | 23.40% |
550
- | chinese_language_and_literature_val | 5/23 | 21.74% |
551
- | college_programming_val | 10/37 | 27.03% |
552
- | accountant_val | 9/49 | 18.37% |
553
- | plant_protection_val | 7/22 | 31.82% |
554
- | middle_school_chemistry_val | 4/20 | 20.00% |
555
- | metrology_engineer_val | 3/24 | 12.50% |
556
- | veterinary_medicine_val | 6/23 | 26.09% |
557
- | marxism_val | 5/19 | 26.32% |
558
- | advanced_mathematics_val | 5/19 | 26.32% |
559
- | high_school_mathematics_val | 4/18 | 22.22% |
560
- | business_administration_val | 8/33 | 24.24% |
561
- | mao_zedong_thought_val | 8/24 | 33.33% |
562
- | ideological_and_moral_cultivation_val | 5/19 | 26.32% |
563
- | college_economics_val | 17/55 | 30.91% |
564
- | professional_tour_guide_val | 10/29 | 34.48% |
565
- | environmental_impact_assessment_engineer_val | 7/31 | 22.58% |
566
- | computer_architecture_val | 6/21 | 28.57% |
567
- | urban_and_rural_planner_val | 11/46 | 23.91% |
568
- | college_physics_val | 5/19 | 26.32% |
569
- | middle_school_mathematics_val | 3/19 | 15.79% |
570
- | high_school_politics_val | 4/19 | 21.05% |
571
- | physician_val | 13/49 | 26.53% |
572
- | college_chemistry_val | 3/24 | 12.50% |
573
- | high_school_biology_val | 5/19 | 26.32% |
574
- | high_school_geography_val | 4/19 | 21.05% |
575
- | middle_school_politics_val | 6/21 | 28.57% |
576
- | clinical_medicine_val | 6/22 | 27.27% |
577
- | computer_network_val | 2/19 | 10.53% |
578
- | sports_science_val | 2/19 | 10.53% |
579
- | art_studies_val | 14/33 | 42.42% |
580
- | teacher_qualification_val | 12/44 | 27.27% |
581
- | discrete_mathematics_val | 6/16 | 37.50% |
582
- | education_science_val | 7/29 | 24.14% |
583
- | fire_engineer_val | 9/31 | 29.03% |
584
- | middle_school_geography_val | 1/12 | 8.33% |
585
 
586
  ```text
587
  总题数: 1346
@@ -593,12 +571,10 @@ minimind模型本身没有使用较大的数据集训练,也没有针对回答
593
 
594
  #### 结果汇总:
595
 
596
- | category | correct | question_count | accuracy |
597
- |:-----------------|:--------:|:--------------:|:---------:|
598
- | minimind-small-T | 344 | 1346 | 25.56% |
599
- | minimind-small | 312 | 1346 | 23.18% |
600
- | minimind | 351 | 1346 | 26.08% |
601
- | minimind-moe | 316 | 1346 | 23.48% |
602
 
603
  #### 以下来自GPT-4o对minimind表现的瞎猜:
604
 
@@ -629,16 +605,16 @@ minimind模型本身没有使用较大的数据集训练,也没有针对回答
629
  ### 推理与导出
630
 
631
  * [./export_model.py](./export_model.py)可以导出模型到transformers格式,推送到huggingface
632
- *
633
 
634
- MiniMind的huggingface集合地址:[MiniMind](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5)
 
635
 
636
  ---
637
 
638
  ### API推理
639
 
640
- [./my_openai_api.py](./my_openai_api.py)完成了openai_api的聊天接口,方便将自己的模型接入第三方UI
641
- 例如fastgpt、OpenWebUI等
642
 
643
  * 从[Huggingface](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5)下载模型权重文件
644
  ```
@@ -685,26 +661,56 @@ MiniMind的huggingface集合地址:[MiniMind](https://huggingface.co/collectio
685
 
686
  # 📌 Acknowledge
687
 
688
- 如果你觉得本项目对你有所帮助,欢迎Star🎉✨。
 
 
689
 
690
- 篇幅不短水平有限难免纰漏,欢迎提issue交流或批评指正。
691
 
692
- 感谢以下开源项目的提供的灵感和开源数据集
 
 
 
 
693
 
694
- * [baby-llama2-chinese](https://github.com/DLLXW/baby-llama2-chinese)
695
- * [ChatLM-mini-Chinese](https://github.com/charent/ChatLM-mini-Chinese)
696
- * [Zero-Chatgpt](https://github.com/AI-Study-Han/Zero-Chatgpt/tree/main)
 
 
 
697
 
698
- ## ✨Top contributors
699
 
700
- <a href="https://github.com/jingyaogong/minimind/graphs/contributors">
701
- <img src="https://contrib.rocks/image?repo=jingyaogong/minimind" />
 
 
 
 
 
 
 
 
 
702
  </a>
703
 
704
- # 📌 Statement
 
 
 
 
 
 
 
 
 
 
 
 
 
 
705
 
706
- 本项目不承担开源模型和代码导致的数据安全、舆情风险或发生任何模型被误导、滥用、传播、不当利用而产生的风险和责任。
707
 
708
- ## License
709
 
710
- This repository is licensed under the [Apache-2.0 License](LICENSE).
 
1
+ <div align="center">
2
+
3
  ![logo](./images/logo.png)
4
+
5
+ </div>
6
+
7
  <div align="center">
8
 
9
  ![visitors](https://visitor-badge.laobi.icu/badge?page_id=jingyaogong/minimind)
 
32
 
33
  ---
34
 
35
+ <div align="center">
36
+
37
+ https://github.com/user-attachments/assets/88b98128-636e-43bc-a419-b1b1403c2055
38
+
39
+ [Bilibili视频链接](https://www.bilibili.com/video/BV12dHPeqE72/?share_source=copy_web&vd_source=670c2504f88726f8cf4a21ef6147c0e8)
40
+
41
+ </div>
42
+
43
  # 📌 Introduction
44
 
45
  大语言模型(LLM)领域,如 GPT、LLaMA、GLM 等,虽然它们效果惊艳,
 
52
  因此,本项目的目标是把上手LLM的门槛无限降低,
53
  直接从0开始训练一个极其轻量的语言模型。
54
 
55
+ > [!TIP]
56
+ > (截至2024-9-17)minimind训练了3个型号模型,最小仅需26M(0.02B),即可具备流畅的对话能力!
57
 
58
+ | 模型 (大小) | tokenizer长度 | 推理占用 | release | 主观评分(/100) |
59
+ |-------------------------|-------------|--------|------------|------------|
60
+ | minimind-v1-small (26M) | 6400 | 0.5 GB | 2024.08.28 | 50' |
61
+ | minimind-v1-moe (4×26M) | 6400 | 1.0 GB | 2024.09.17 | 55' |
62
+ | minimind-v1 (108M) | 6400 | 1.0 GB | 2024.09.01 | 60' |
 
 
63
 
64
  > 该分析在一个带有Torch 2.1.2、CUDA 12.2和Flash Attention 2的RTX 3090 GPU上运行。
65
 
 
69
 
70
  - 公开MiniMind模型代码(包含Dense和MoE模型)、Pretrain、SFT指令微调、LoRA微调、DPO偏好优化的全过程代码、数据集和来源。
71
  - 兼容`transformers`、`accelerate`、`trl`、`peft`等流行框架。
72
+ - 训练支持单机单卡、单机多卡(DDP、DeepSpeed)训练。训练过程中支持在任意位置停止,及在任意位置继续训练。
73
  - 在Ceval数据集上进行模型测试的代码。
74
  - 实现Openai-Api基本的chat接口,便于集成到第三方ChatUI使用(FastGPT、Open-WebUI等)。
75
 
76
  希望此开源项目可以帮助LLM初学者快速入门!
77
 
78
+ ### 👉**最近更新**
79
+
80
+ <details close>
81
+ <summary> <b>2024-09-17 (new🎉)</b> </summary>
82
+
83
+ - 更新minimind-v1-moe模型
84
+
85
+ - 为了防止歧义,不再使用mistral_tokenizer分词,全部采用自定义的minimind_tokenizer作为分词器。
86
+
87
+ </details>
88
+
89
+ <details close>
90
+ <summary> <b>2024-09-01</b> </summary>
91
+
92
+ - 更新minimind-v1 (108M)模型,采用minimind_tokenizer,预训练轮次3 + SFT轮次10,更充分训练,性能更强。
93
+
94
+ - 项目已部署至ModelScope创空间,可以在此网站上体验:
95
+
96
+ - [ModelScope在线体验](https://www.modelscope.cn/studios/gongjy/minimind)
97
+
98
+ </details>
99
 
100
  <details close>
 
 
101
  <summary> <b>2024-08-27</b> </summary>
102
+
103
+ - 项目首次开源
104
+
105
  </details>
106
 
107
  # 📌 Environment
 
114
  * CUDA == 12.2
115
  * [requirements.txt](./requirements.txt)
116
 
117
+ # 📌 Quick Inference & Test
118
 
119
  <div align="center" style="font-size: 1.5em; font-weight: bold;">
120
  <img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" alt="Hugging Face Logo" style="vertical-align: middle; height: 30px;" />
121
  Hugging Face
122
 
123
+ [MiniMind (HuggingFace)](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5)
124
+
125
+ <img src="https://g.alicdn.com/sail-web/maas/1.15.0/static/modelscopeIcon.cd89353f.svg" alt="Hugging Face Logo" style="vertical-align: middle; height: 30px;" />
126
+
127
+ [MiniMind (ModelScope)](https://www.modelscope.cn/models/gongjy/minimind-v1)
128
+
129
  </div>
130
 
131
  ```bash
132
  # step 1
133
+ git clone https://huggingface.co/jingyaogong/minimind-v1
134
  ```
135
 
136
  ```bash
 
138
  python 2-eval.py
139
  ```
140
 
141
+ 或者启动streamlit,启动网页聊天界面
142
+
143
+ ```bash
144
+ # or step 3, use streamlit
145
+ streamlit run fast_inference.py
146
+ ```
147
+
148
+ ![](./images/streamlit.png)
149
+
150
+ <div align="center">
151
+
152
+ 项目已部署至ModelScope创空间,可以在此网站上体验:
153
+
154
+ [ModelScope在线体验](https://www.modelscope.cn/studios/gongjy/minimind)
155
+
156
+
157
+ </div>
158
+
159
  # 📌 Quick Start
160
 
161
+ * 0、环境安装
162
+ ```bash
163
+ pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
164
+ ```
165
  * 1、克隆项目代码
166
  ```text
167
  git clone https://github.com/jingyaogong/minimind.git
 
170
 
171
  * 2.1 下载[数据集下载地址](#数据集下载地址)放到`./dataset`目录下
172
 
173
+ * 2.2 `python data_process.py`处理数据集,例如pretrain数据提前进行token-encoder、sft数据集抽离qa到csv文件
174
 
175
+ * 2.3 在`./model/LMConfig.py` 中调整model的参数配置
176
+ * 2.4 `python 1-pretrain.py` 执行预训练
177
+ * 2.5 `python 3-full_sft.py` 执行指令微调
178
+ * 2.6 `python 4-lora_sft.py` 执行lora微调(非必须)
179
+ * 2.7 `python 5-dpo_train.py` 执行DPO人类偏好强化学习对齐(非必须)
180
  * 3、测试模型推理效果
181
+ * 确保需要使用的,训练完成的参数权重位于`./out/`目录下
182
+ * 也可以直接去[训练完成的模型权重](#训练完成的模型权重)下载使用我训练好的
183
  ```text
184
  out
185
  ├── multi_chat
 
186
  │   ├── full_sft_512.pth
187
+ │   ├── full_sft_512_moe.pth
188
+ │   └── full_sft_768.pth
189
  ├── single_chat
 
190
  │   ├── full_sft_512.pth
191
+ │   ├── full_sft_512_moe.pth
192
+ │   └── full_sft_768.pth
193
+ ├── pretrain_768.pth
194
+ ├── pretrain_512_moe.pth
195
+ ├── pretrain_512.pth
 
 
 
 
196
  ```
197
  * `python 0-eval_pretrain.py`测试预训练模型的接龙效果
198
  * `python 2-eval.py`测试模型的对话效果
199
  ![2-eval](./images/2-eval.png)
200
 
201
+ 🍭 【Tip】预训练和全参微调pretrain和full_sft均支持多卡加速
 
 
202
 
203
+ * 单机N卡启动训练(DDP)
204
+ ```bash
205
  torchrun --nproc_per_node N 1-pretrain.py
206
+ # and
 
 
207
  torchrun --nproc_per_node N 3-full_sft.py
208
  ```
209
+ * 单机N卡启动训练(DeepSpeed)
210
+ ```bash
211
+ deepspeed --master_port 29500 --num_gpus=N 1-pretrain.py
212
+ # and
213
+ deepspeed --master_port 29500 --num_gpus=N 3-full_sft.py
214
+ ```
215
 
216
  # 📌 Data sources
217
 
 
223
  因为LLM体积非常小,为了避免模型头重脚轻(词嵌入embedding层参数占整个LLM比太高),所以词表长度需要选择比较小。
224
  强大的开源模型例如01万物、千问、chatglm、mistral、Llama3等,它们的tokenizer词表长度如下:
225
 
226
+ <table>
227
+ <tr><th>Tokenizer模型</th><th>词表大小</th><th>来源</th></tr>
228
+ <tr><td>yi tokenizer</td><td>64,000</td><td>01万物(中国)</td></tr>
229
+ <tr><td>qwen2 tokenizer</td><td>151,643</td><td>阿里云(中国)</td></tr>
230
+ <tr><td>glm tokenizer</td><td>151,329</td><td>智谱AI(中国)</td></tr>
231
+ <tr><td>mistral tokenizer</td><td>32,000</td><td>Mistral AI(法国)</td></tr>
232
+ <tr><td>llama3 tokenizer</td><td>128,000</td><td>Meta(美国)</td></tr>
233
+ <tr><td>minimind tokenizer</td><td>6,400</td><td>自定义</td></tr>
234
+ </table>
235
 
236
+ > [!TIP]
237
+ > 2024-09-17更新:为了防止过去的版本歧义&控制体积,minimind所有模型均使用minimind_tokenizer分词,废弃所有mistral_tokenizer版本。
 
238
 
239
+ > 尽管minimind_tokenizer长度很小,编解码效率弱于qwen2、glm等中文友好型分词器。
240
+ > 但minimind模型选择了自己训练的minimind_tokenizer作为分词器,以保持整体参数轻量,避免编码层和计算层占比失衡,头重脚轻,因为minimind的词表大小只有6400。
241
+ > 且minimind在实际测试中没有出现过生僻词汇解码失败的情况,效果良好。
242
+ > 由于自定义词表压缩长度到6400,使得LLM总参数量最低只有26M。
243
 
244
  ---
245
 
246
+ - 📙【Pretrain数据】:
247
+ [Seq-Monkey通用文本数据集](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md) / [Seq-Monkey百度网盘](https://pan.baidu.com/s/114F1k3eksiWCOQLvaT3RYQ?pwd=6666)
248
+ 是由多种公开来源的数据(如网页、百科、博客、开源代码、书籍等)汇总清洗而成。整理成统一的JSONL格式,并经过了严格的筛选和去重,确保数据的全面性、规模、可信性和高质量。总量大约在10B
249
+ token,适合中文大语言模型的预训练。
250
 
251
+ > 第2种选择:[SkyPile-150B数据集](https://hf-mirror.com/datasets/Skywork/SkyPile-150B/tree/main/data)
252
+ 的可公开访问部分包含约2.33亿个独立网页,每个网页平均包含1000多个汉字。数据集包括大约1500亿个令牌和620GB的纯文本数据。
253
+ **如果着急的话**,可以尝试只挑选SkyPile-150B的部分jsonl下载(并在./data_process.py中对文本tokenizer生成*
254
+ .bin文件),以便快速跑通预训练流程。
255
 
256
  ---
257
 
 
285
 
286
  ### 数据集下载地址
287
 
288
+ 下载到`./dataset/`目录下
289
+
290
+ | MiniMind训练数据集 | 下载地址 |
291
+ |--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
292
+ | **【tokenizer训练集】** | [HuggingFace](https://huggingface.co/datasets/jingyaogong/minimind_dataset/tree/main) / [百度网盘](https://pan.baidu.com/s/1yAw1LVTftuhQGAC1Y9RdYQ?pwd=6666) |
293
+ | **【Pretrain数据】** | [Seq-Monkey官方](http://share.mobvoi.com:5000/sharing/O91blwPkY) / [百度网盘](https://pan.baidu.com/s/1-Z8Q37lJD4tOKhyBs1D_6Q?pwd=6666) / [HuggingFace](https://huggingface.co/datasets/jingyaogong/minimind_dataset/tree/main) |
294
+ | **【SFT数据】** | [匠数大模型SFT数据集](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data/resolve/master/sft_data_zh.jsonl) |
295
+ | **【DPO数据1】** | [活字数据集1](https://huggingface.co/datasets/Skepsun/huozi_rlhf_data_json) |
296
+ | **【DPO数据2】** | [活字数据集2](https://huggingface.co/datasets/beyond/rlhf-reward-single-round-trans_chinese) |
297
 
298
  # 📌 Model
299
 
 
317
  ![](./images/LLM-structure.png)
318
  ![](./images/LLM-structure-moe.png)
319
 
320
+ 修改模型配置见[./model/LMConfig.py](./model/LMConfig.py)
321
+ minimind目前训练的模型版本见下表:
322
 
323
  | Model Name | params | len_vocab | n_layers | d_model | kv_heads | q_heads | share+route | TopK |
324
  |------------------|--------|-----------|----------|---------|----------|---------|-------------|------|
325
+ | minimind-v1-small | 26M | 6400 | 8 | 512 | 8 | 16 | - | - |
326
+ | minimind-v1-moe | 4×26M | 6400 | 8 | 512 | 8 | 16 | 2+4 | 2 |
327
+ | minimind-v1 | 108M | 6400 | 16 | 768 | 8 | 16 | - | - |
 
 
328
 
 
 
329
 
330
  # 📌 Experiment
331
 
 
336
  环境:python 3.9 + Torch 2.1.2 + DDP多卡训练
337
  ```
338
 
339
+ | Model Name | params | len_vocab | batch_size | pretrain_time | sft_single_time | sft_multi_time |
340
+ |------------------|--------|-----------|------------|-------------------|-------------------|---------------------|
341
+ | minimind-v1-small | 26M | 6400 | 64 | ≈2 hour (1 epoch) | ≈2 hour (1 epoch) | ≈0.5 hour (1 epoch) |
342
+ | minimind-v1-moe | 4×26M | 6400 | 40 | ≈6 hour (1 epoch) | ≈5 hour (1 epoch) | ≈1 hour (1 epoch) |
343
+ | minimind-v1 | 108M | 6400 | 16 | ≈6 hour (1 epoch) | ≈4 hour (1 epoch) | ≈1 hour (1 epoch) |
 
 
344
 
345
  ---
346
 
 
348
  - LLM首先要学习的并非直接与人交流,而是让肚子中充满知识的墨水,至于墨水理论上喝的越饱越好,产生大量的对世界的认知积累。
349
  - 预训练就是让Model先埋头苦学大量基本的知识,例如从维基百科、新闻、常识、书籍等。
350
  - 它无监督的从大量的文本数据中压缩知识到自己模型的权重,目的是:学会词语接龙。例如我们输入“秦始皇是”四个字,它在大量学习后能预测出下一句话大概率是“中国的第一位皇帝”。
351
+ > pretrain的学习率设置为1e-4到1e-5的动态学习率,预训练epoch数设为5。
352
  ```bash
353
  torchrun --nproc_per_node 2 1-pretrain.py
354
  ```
 
358
  ”后不再无脑接龙,而是意识��这是一段完整的对话结束。
359
  - 我们称这个过程为指令微调,就如同让学富五车的「牛顿」先生适应21世纪的聊天习惯,学习屏幕左侧是对方消息,右侧是本人消息这个规律。
360
  - 在训练时,MiniMind的指令和回答长度被截断在512,是为了节省显存空间。就像我们学习时,会先从短的文章开始,当学会阅读200字作文后,800字长文章就不需要再单独学习。
361
+ > 在推理时通过调整RoPE线性差值,实现长度外推到1024或2048及以上很方便。学习率设置为1e-5到1e-6的动态学习率,微调epoch数为6
362
 
363
  ```bash
364
  # 3-full_sft.py中设置数据集为sft_data_single.csv
 
370
  - 构建【问题->回答,问题->回答,问题->】的新聊天模板,然后使用这个数据集进行微调。
371
  - 学习完成的模型不仅仅只能回答当前问题,还能根据历史对话进行连贯的对话。
372
  - 这一步并非必须,因为小模型长上文对话能力很弱,强行对齐多轮问答模板会损失一定程度的单轮SFT效果。
373
+ > 学习率设置为1e-5到1e-6的动态学习率,微调epoch数为5
374
  ```bash
375
  # 3-full_sft.py中设置数据集为sft_data.csv
376
  torchrun --nproc_per_node 2 3-full_sft.py
 
382
  ```bash
383
  python 5-dpo_train.py
384
  ```
 
 
 
 
 
 
 
 
 
 
 
 
385
  ---
386
 
387
+ 📋关于LLM的参数配置,有一篇很有意思的论文[MobileLLM](https://arxiv.org/pdf/2402.14905)做了详细的研究和实验。
388
  scaling law在小模型中有自己独特的规律。
389
  引起Transformer参数成规模变化的参数几乎只取决于`d_model`和`n_layers`。
390
 
 
397
  例如当模型参数固定在125M或者350M时,30~42层的「狭长」模型明显比12层左右的「矮胖」模型有更优越的性能,
398
  在常识推理、问答、阅读理解等8个基准测试上都有类似的趋势。
399
  这其实是非常有趣的发现,因为以往为100M左右量级的小模型设计架构时,几乎没人尝试过叠加超过12层。
 
400
  这与MiniMind在训练过程中,模型参数量在`d_model`和`n_layers`之间进行调整实验观察到的效果是一致的。
401
  然而「深而窄」的「窄」也是有维度极限的,当d_model<512时,词嵌入维度坍塌的劣势非常明显,
402
  增加的layers并不能弥补词嵌入在固定q_head带来d_head不足的劣势。
403
  当d_model>1536时,layers的增加似乎比d_model的优先级更���,更能带来具有“性价比”的参数->效果增益。
404
+ 因此MiniMind设定small模型的d_model=512,n_layers=8来获取的「极小体积<->更好效果」的平衡。
405
+ 设定d_model=768,n_layers=16来获取效果的更大收益,更加符合小模型scaling-law的变化曲线。
406
 
 
407
 
408
+ > 作为参考,GPT3的参数设定见下表:
409
+
410
+ ![gpt3_config.png](./images/gpt3_config.png)
411
+
412
+ ---
413
+ ### 训练完成的模型权重
414
+
415
+ | Model Name | params | Config | pretrain_model | single_sft_model | multi_sft_model |
416
+ |-------------------|--------|-----------------------------|----------------|----------------------------------------------------------------|----------------------------------------------------------------|
417
+ | minimind-v1-small | 26M | d_model=512<br/>n_layers=8 | - | [链接](https://pan.baidu.com/s/1_COe0FQRDmeapSsvArahCA?pwd=6666) | [链接](https://pan.baidu.com/s/1GsGsWSL0Dckl0YPRXiBIFQ?pwd=6666) |
418
+ | minimind-v1-moe | 4×26M | d_model=512<br/>n_layers=8 | - | - | - |
419
+ | minimind-v1 | 108M | d_model=768<br/>n_layers=16 | - | [链接](https://pan.baidu.com/s/1p713loS7EfwHQf3G9eYI3Q?pwd=6666) | [链接](https://pan.baidu.com/s/12iHGpAs6R0kqsOnGtgK6vQ?pwd=6666) |
420
+
421
+ ---
422
+
423
 
424
+ # 📌 Eval
 
 
 
 
 
425
 
426
+ > [!TIP]
427
+ > 以下测试于2024-09-17完成,此日期后发布的新模型,无特殊需要时将不加入测试。
 
 
 
 
 
 
428
 
429
+ [A] [minimind-v1-small(0.02B)](https://pan.baidu.com/s/1_COe0FQRDmeapSsvArahCA?pwd=6666)<br/>
430
+ [B] minimind-v1-moe(0.1B)<br/>
431
+ [C] [minimind-v1(0.1B)](https://pan.baidu.com/s/1p713loS7EfwHQf3G9eYI3Q?pwd=6666)<br/>
432
+ [D] [baby-llama2-chinese(0.2B)](https://github.com/DLLXW/baby-llama2-chinese)<br/>
433
+ [E] [chatlm-mini-chinese(0.2B)](https://github.com/charent/ChatLM-mini-Chinese)<br/>
434
 
435
+ ```text
436
  [Q]: 你知道长江吗?
437
  [A]: 是的,长江是中国最长的河流,全长超过3300千米。
438
  [B]: 是的,长江是中国最长的河流,全长约7100公里。它的流域面积约为3328万平方公里,是中国最大的河流之一,也是世界第三大河流,总长度约为6300千米。
439
+ [C]: 是的,长江是中国最长的河流之一,也是世界第三大河流。长江发源于青藏高原,流经11个省份,最终注入东海。长江流域是中国最大的河流流域之一,拥有着丰富的自然资源和文化遗产,如长江大桥、长江海洋博物馆、长江古城等。
440
+ [D]: 同时也知道,中国是世界上最长的城市之一,拥有超过3400万人口。
441
+ [E]: 是的,长江是中国最长的河流,也是世界上最长的河流之一。它发源于青藏高原,流经九个省区,最终注入东海。长江流域的气候温和,四季分明,是中国重要的农业和工业基地。长江流域的水产资源丰富,是中国的重要渔业资源之一。
 
442
 
443
 
444
  [Q]: 第一颗人造卫星是哪个国家发射的?
445
  [A]: 第一颗人造卫星是斯普特尼克1号。
446
  [B]: 第一颗人造卫星是美国发射的。
447
+ [C]: 第一颗人造卫星是苏联发射的,它是由苏联于1957年10月发射的。
448
+ [D]: 这颗名为“阿姆斯特朗”的卫星,于1961年11月20日发射,是第一个成功将人类送入太空的卫星。
449
+ [E]: 第一颗人造卫星是苏联于195744日发射的,它是由苏联宇航员尤里·加加林(Yuri Gagarin)和苏联工程师罗尔德·谢尔盖耶夫(Roald Skohley)共同发射的。在发射过程中,加加林和谢尔盖耶夫使用了一个名为“斯普特尼克1号”的小型卫星,它的直径为4.3米,重量为1,051公斤。这枚卫星被称为“斯普特尼克1号”,是苏联太空竞赛中的一个里程碑,也是人类历史上第一次在太空中进行太空行走。
 
450
 
451
 
452
  [Q]: 你知道泰山在哪里吗?
453
  [A]: 泰山位于中国东部地区,是中国五岳之一。
454
  [B]: 泰山位于山东省泰安市,是中国五岳之一。
455
+ [C]: 泰山位于中国山东省泰安市,是中国五岳之一。
456
+ [D]: 作为一个人工智能助手,我无法访问互联网,因此无法提供关于你所在位置的信息。不过,你可以通过搜索或搜索相关关键词来获取有关你所在位置的信息,或者查看相关新闻或网站来获取有关你所在位置的信息。
457
+ [E]: 泰山位于中国山东省泰安市境内,是中国五岳之一,也是中国文化和历史的重要象征之一。它是中国五岳之一,也是中国道教和佛教的发源地之一。泰山的海拔高度为1,545米,是中国五岳之一,也是中国最著名的旅游景点之一。它的历史可以追溯到公元前5世纪,是中国古代帝王祭祀的场所。泰山也是中国文化的重要组成部分,被誉为“五岳之首”,是中国文化的重要遗产之一。
 
458
 
459
 
460
  [Q]: 世界上最高的山峰是什么?
461
  [A]: 世界上最高的山峰是珠穆朗玛峰,位于尼泊尔和中国的边境上,海拔8848米。
462
  [B]: 珠穆朗玛峰是世界上最高的山峰,位于喜马拉雅山脉,海拔8848米。
463
+ [C]: 珠穆朗玛峰是世界上最高的山峰,海拔8848米。
464
+ [D]: 世界上最高的山峰是珠穆朗玛峰,位于喜马拉雅山脉,海拔8,848.86米(29,031.69英尺)。
465
+ [E]: 世界上最高的山峰是珠穆朗玛峰,也称为萨加玛塔,位于喜马拉雅山脉的尼泊尔和中国边境之间,海拔8,848.86米(29,031.69英尺)。它是世界上最著名的登山目标之一,吸引着来自世界各地的登山者前来挑战。珠穆朗玛峰的名字来自藏语,意思是“女神的额峰”,它是由两个不同的山峰组成的,其中一个是位于尼泊尔的珠穆朗玛峰,另一个则位于中国。它的高度是通过测量海拔高度来确定的,因为它是通过测量地球的平均高度来确定的。尽管珠穆朗玛峰是世界上最高的山峰,但它并不是地球上任何其他地方都能到达的。地球上的其他地方,如南美洲的安第斯山脉,也是世界上最高的山脉,但它并不是地球上最高的山峰。总之,珠穆朗玛峰是世界上最高的山峰,但它并不是地球上任何地方都能到达的。它是地球上最高的山脉,但它并不是地球上任何地方都能到达的。
 
466
 
467
 
468
  [Q]: 万有引力是谁提出的?
469
  [A]: 万有引力是由英国物理学家牛顿提出的。他认为,所有物体都会产生引力,这个引力是由物体的质量和距离决定的。这个理论揭示了质量和距离之间的关系,并证明了万有引力的本质。
470
  [B]: 万有引力是由牛顿发现的,它是由英国物理学家詹姆斯·瓦特于1879年提出的。
471
+ [C]: 万有引力是英国物理学家艾萨克·牛顿在17世纪提出的。
472
+ [D]: 引力是宇宙中一种力量,它是由质量或能量所组成的。根据牛顿的万有引力定律,一个物体之间的引力大小与它的质量成正比,与它的质量成反比。因此,牛顿的万有引力定律表明,一个物体之间的引力大小与它的质量成正比,与它的质量成反比。这个定律在物理学、天文学和物理学中都有广泛的应用。
473
+ [E]: 万有引力是由英国物理学家艾萨克·牛顿于1687年提出的。他在1687年发表了一篇著名的论文,题为《自然哲学的数学原理》,其中提出了万有引力定律,指出任何两个物体之间都存在着一种引力,这种引力的大小与它们的质量成正比,与它们之间的距离的平方成反比。这个定律被称为万有引力定律,它描述了质量和��离之间的关系。它表明,两个物体之间的引力与它们的质量成正比,与它们之间的距离的平方成反比。这个定律对于解释宇宙中的许多现象和现象之间的相互作用非常重要,例如行星绕太阳运动、恒星之间的引力透镜效应等等。它也被广泛应用于天文学、物理学和工程学等领域,为我们理解宇宙的本质和演化提供了重要的理论基础。
 
474
 
475
 
476
  [Q]: 你知道大熊猫的主要食物是什么吗?
477
  [A]: 大熊猫的主要食物是竹子。它们通常吃竹子,每天需要吃掉大约30公斤的竹子。
478
  [B]: 大熊猫主要是竹子,它们的主要食物是竹子。大熊猫的消化系统比较特殊,它们每天需要消化大量的竹子才能维持其生存需要。除了竹子外,大熊猫还在野外生存。
479
+ [C]: 大熊猫主要以竹子为食,也会偶尔食用其他的植物和动物。此外,大熊猫也会吃一些野果、小动物和植物,但这些都不属于它们的主要食物来源。
480
+ [D]: 大熊猫的主要食物是竹子。
481
+ [E]: 大熊猫是一种以竹子为主要食物的植食动物,它们的肠道结构适合消化纤维素含量高的竹子,但也会吃一些其他植物,如草本植物和小型哺乳动物。大熊猫的肠道结构也适应了高纤维的竹子,这使得它们能够消化大量的纤维素,从而帮助它们在野外生存。大熊猫的肠道中还含有一种叫做“纤维素酶”的酶类物质,这种酶能够帮助大熊猫分解纤维素,从而帮助它们消化大量的水分和营养物质。此外,大熊猫的肠道还含有一种叫做“纤维素蛋白”的蛋白质,这种蛋白质能够帮助大熊猫维持体内的能量水平,从而帮助它们在野外生存。总的来说,大熊猫的主要食物是竹子,但它们也会吃一些其他植物,如草本植物和小型哺乳动物。
482
+
483
 
484
  [Q]: 海水为什么是咸的?
485
  [A]: 海水中含有许多盐分,这些盐分可以导致海水咸度的增加。海水中的盐分主要来自于海水中的盐分,这些盐分可以从海水中的盐分中获得。海水中的氧化物和盐分可以与海水中的盐分形成氢键,从而使海水咸度增加。
486
  [B]: 海水中含有丰富的钙、磷、钾和磷等元素,这些元素在海洋中的数量和分布非常有限。此外,海水中还含有各种生物和微生物的营养物质,如贝壳和海藻等。
487
+ [C]: 海水含有许多溶解的盐类,其中最主要的是氯化钠。当水中加入盐时,盐的分子会与水分子发生反应,形成氢氧化钠,这是一种盐。
488
+ [D]: 因为水是许多生物的生存所必需的,它能够吸收和释放二氧化碳,释放氧气,维持大气中的水蒸气,以及维持水循环。水中的盐类和化学物质会释放出氧气,使地球上的生物呼吸。此外,水还具有许多其他功能,如保持水循环、维持水循环、调节气候和环境等。水是许多生物生存所必需的,它也是许多自然现象的重要基础。
489
+ [E]: 海水是咸的主要原因是因为其中含有大量的溶解盐类,其中最主要的是氯化钠(NaCl)。氯化钠是一种非常常见的盐类,它存在于地球上的大部分地区,包括海洋、湖泊、河流和地下水。当水从海洋中蒸发时,它会带走一部分盐分,使海水变得咸。此外,海中还含有其他溶解盐类的物质,如硫酸盐、硫酸镁、硫酸钙、硫酸镁盐、硫酸镁钾、硫酸镁钠、硫酸镁钙、硫酸镁镁钙、硫酸镁镁钙、硫酸镁镁钙、硫酸镁镁钠、硫酸镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁钙、硫酸镁镁镁镁钙、硫酸镁镁镁镁钙、硫酸镁镁镁镁钙、硫酸镁镁镁镁钙、硫酸镁镁镁镁钙、硫酸镁镁镁镁钙、硫酸镁镁镁镁镁钙、硫酸镁镁镁镁镁钙、硫酸镁镁镁镁镁钙、硫酸镁镁镁镁镁钙、硫酸镁镁镁镁镁镁钙、硫酸镁镁镁镁镁镁钙、硫酸镁镁镁镁镁镁钙、硫酸镁镁镁镁镁镁镁钙、硫酸镁镁镁镁
 
490
  ```
491
 
492
+ > [!NOTE]
493
+ > 🙋‍♂️直接把上述模型的回答丢给GPT-4o,让它帮忙打个分:
494
 
495
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
496
 
497
+ ### 模型表现点评:
 
 
 
498
 
499
+ 1. **模型A**:
500
+ - **表现**:模型A的回答通常简洁明了,但在某些问题上缺乏详细信息和准确性。例如,在长江的长度问题上,模型A的回答是错误的。
501
+ - **评分**:60
502
 
503
+ 2. **模型B**:
504
+ - **表现**:模型B的回答在某些问题上提供了额外的信息,但这些信息有时是不准确的或多余的。例如,在长江的长度问题上,模型B提供了不准确的长度和流域面积。
505
+ - **评分**:65
 
506
 
507
+ 3. **模型C**:
508
+ - **表现**:模型C的回答通常较为详细,且在大多数问题上提供了准确的信息。例如,在长江和泰山的问题上,模型C的回答是准确的。
509
+ - **评分**:75
510
 
511
+ 4. **模型D**:
512
+ - **表现**:模型D的回答在某些问题上显得混乱,且缺乏准确性。例如,在泰山的问题上,模型D的回答完全偏离了主题。
513
+ - **评分**:50
 
514
 
515
+ 5. **模型E**:
516
+ - **表现**:模型E的回答通常非常详细,但在某些问题上过于冗长,且包含了一些不必要的信息。例如,在万有引力的问题上,模型E的回答过于复杂。
517
+ - **评分**:70
518
 
519
+ #### 排序(从高到低):
 
 
 
 
 
520
 
521
+ | 模型 | C | E | B | A | D |
522
+ |----|----|----|----|----|----|
523
+ | 分数 | 75 | 70 | 65 | 60 | 50 |
524
 
525
  ---
526
 
527
  ## 👉效果总结
528
 
529
+ * minimind系列(ABC)的排序符合直觉,minimind-v1(0.1B)评分最高,常识性问题的回答基本没有错误和幻觉。
530
+ * 出乎意料的是,minimind-v1-small(0.02B)仅有26M参数,却可以接近minimind-v1(0.1B)的表现。
531
+ * minimind-v1(0.1B)的sft轮数`epochs`仅有不到2,偷懒提前kill腾出资源给小模型,0.1B没有得到充分训练的情况下依然做到了最强,其实还是底大一级压死人。
532
+ * minimind-v1-moe(0.1B)
533
+ 表现很差,同样是因为偷懒提前kill腾出资源给小模型,但是MoE模型多专家模式需要的训练轮次本来就需要酌情更高,在epochs设置为2时训练的极其不充分。minimind不久前实验阶段在Yi
534
+ tokenizer上试验过moe的充分训练版本,可以做到比dense表现肉眼可见的更好。日后腾出服务器再训练更新v2v3版本。
 
535
 
 
536
 
537
+ * E模型的回答看起来是这里最完美的,尽管存在些许幻觉瞎编的情况。但GPT-4o和Deepseek的评分都一致认为它“信息过度冗长,且有重复内容,存在幻觉”。
538
+ 其实这种评价太严格了,100个字中有10个字是幻觉,就很容易把它归到0分。由于F模型训练文本默认长度更长,数据集大得多,所以回答的看起来很完备,在体积近似的情况下,数据比模型更重要得多。
539
 
540
+ > 🙋‍♂️个人主观评价:E>C>B≈A>D
541
 
542
+ > 🤖 GPT-4o 评价:C>E>B>A>D
543
+
544
+ Scaling Law:模型参数越大,训练数据越多模型的性能越强。
545
 
546
  # 📌 Objective dataset: C-Eval
547
 
 
552
 
553
  > 例如minimind-small的结果细项:
554
 
555
+ | Type | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 |
556
+ |------|----------------------------|-----|-----------------------|-----------------------|---------------------|--------------------|---------------------|---------------------|----------------|------------------------|-----------------------|-----------------------|----------------|------------------|-------|---------------------|---------------|---------------------------------|---------------------|------------|------------------|-------------------------|--------------------|---------------------|---------|----------------------|-------------------------|-------------------------|--------------------|-----------------------------------|-------------------|-------------------------|------------------------------------------|-----------------------|-------------------------|-----------------|---------------------------|----------------------|-----------|-------------------|---------------------|-----------------------|------------------------|-------------------|------------------|----------------|-------------|-----------------------|----------------------|-------------------|---------------|-------------------------|
557
+ | Data | probability_and_statistics | law | middle_school_biology | high_school_chemistry | high_school_physics | legal_professional | high_school_chinese | high_school_history | tax_accountant | modern_chinese_history | middle_school_physics | middle_school_history | basic_medicine | operating_system | logic | electrical_engineer | civil_servant | chinese_language_and_literature | college_programming | accountant | plant_protection | middle_school_chemistry | metrology_engineer | veterinary_medicine | marxism | advanced_mathematics | high_school_mathematics | business_administration | mao_zedong_thought | ideological_and_moral_cultivation | college_economics | professional_tour_guide | environmental_impact_assessment_engineer | computer_architecture | urban_and_rural_planner | college_physics | middle_school_mathematics | high_school_politics | physician | college_chemistry | high_school_biology | high_school_geography | middle_school_politics | clinical_medicine | computer_network | sports_science | art_studies | teacher_qualification | discrete_mathematics | education_science | fire_engineer | middle_school_geography |
558
+
559
+ | Type | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 |
560
+ |----------|--------|--------|--------|--------|--------|-------|--------|--------|--------|--------|--------|--------|-------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|-------|
561
+ | T/A | 3/18 | 5/24 | 4/21 | 7/19 | 5/19 | 2/23 | 4/19 | 6/20 | 10/49 | 4/23 | 4/19 | 4/22 | 1/19 | 3/19 | 4/22 | 7/37 | 11/47 | 5/23 | 10/37 | 9/49 | 7/22 | 4/20 | 3/24 | 6/23 | 5/19 | 5/19 | 4/18 | 8/33 | 8/24 | 5/19 | 17/55 | 10/29 | 7/31 | 6/21 | 11/46 | 5/19 | 3/19 | 4/19 | 13/49 | 3/24 | 5/19 | 4/19 | 6/21 | 6/22 | 2/19 | 2/19 | 14/33 | 12/44 | 6/16 | 7/29 | 9/31 | 1/12 |
562
+ | Accuracy | 16.67% | 20.83% | 19.05% | 36.84% | 26.32% | 8.70% | 21.05% | 30.00% | 20.41% | 17.39% | 21.05% | 18.18% | 5.26% | 15.79% | 18.18% | 18.92% | 23.40% | 21.74% | 27.03% | 18.37% | 31.82% | 20.00% | 12.50% | 26.09% | 26.32% | 26.32% | 22.22% | 24.24% | 33.33% | 26.32% | 30.91% | 34.48% | 22.58% | 28.57% | 23.91% | 26.32% | 15.79% | 21.05% | 26.53% | 12.50% | 26.32% | 21.05% | 28.57% | 27.27% | 10.53% | 10.53% | 42.42% | 27.27% | 37.50% | 24.14% | 29.03% | 8.33% |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
563
 
564
  ```text
565
  总题数: 1346
 
571
 
572
  #### 结果汇总:
573
 
574
+ | category | correct | question_count | accuracy |
575
+ |:------------------|:--------:|:--------------:|:--------:|
576
+ | minimind-v1-small | 344 | 1346 | 25.56% |
577
+ | minimind-v1 | 351 | 1346 | 26.08% |
 
 
578
 
579
  #### 以下来自GPT-4o对minimind表现的瞎猜:
580
 
 
605
  ### 推理与导出
606
 
607
  * [./export_model.py](./export_model.py)可以导出模型到transformers格式,推送到huggingface
 
608
 
609
+ * MiniMind的huggingface集合地址:
610
+ [MiniMind](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5)
611
 
612
  ---
613
 
614
  ### API推理
615
 
616
+ * [my_openai_api.py](./my_openai_api.py)完成了openai_api的聊天接口,方便将自己的模型接入第三方UI
617
+ 例如fastgpt、OpenWebUI等
618
 
619
  * 从[Huggingface](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5)下载模型权重文件
620
  ```
 
661
 
662
  # 📌 Acknowledge
663
 
664
+ > [!NOTE]
665
+ > 如果您觉得 `MiniMind`对您有所帮助,请在 GitHub 上给一个⭐<br/>
666
+ > 您的支持是我们持续改进项目的动力!篇幅不短水平有限难免纰漏,欢迎在issue交流和指正。
667
 
668
+ ## 🤝[贡献者](https://github.com/jingyaogong/minimind/graphs/contributors)
669
 
670
+ <!--
671
+ <a href="https://github.com/jingyaogong/minimind/graphs/contributors">
672
+ <img src="https://contrib.rocks/image?repo=jingyaogong/minimind&v3" />
673
+ </a>
674
+ -->
675
 
676
+ <a href="https://github.com/jingyaogong"><img src="https://avatars.githubusercontent.com/u/62287848" width="70px" height="70px"/></a>
677
+ &nbsp;
678
+ <a href="https://github.com/MuWinds"><img src="https://avatars.githubusercontent.com/u/93832089" width="70px" height="70px"/></a>
679
+ &nbsp;
680
+ <a href="https://github.com/chuanzhubin"><img src="https://avatars.githubusercontent.com/u/2813798" width="70px" height="70px"/></a>
681
+ &nbsp;
682
 
683
+ ## 😊鸣谢
684
 
685
+ <a href="https://github.com/ipfgao"><b>@ipfgao</b></a>:
686
+ <a href="https://github.com/jingyaogong/minimind/issues/26">🔗训练步骤记录</a>
687
+
688
+ ## 🫶支持者
689
+
690
+ <a href="https://github.com/jingyaogong/minimind/stargazers">
691
+ <picture>
692
+ <source media="(prefers-color-scheme: dark)" srcset="https://reporoster.com/stars/dark/jingyaogong/minimind"/>
693
+ <source media="(prefers-color-scheme: light)" srcset="https://reporoster.com/stars/jingyaogong/minimind"/>
694
+ <img alt="github contribution grid snake animation" src="https://reporoster.com/stars/jingyaogong/minimind"/>
695
+ </picture>
696
  </a>
697
 
698
+ <a href="https://github.com/jingyaogong/minimind/network/members">
699
+ <picture>
700
+ <source media="(prefers-color-scheme: dark)" srcset="https://reporoster.com/forks/dark/jingyaogong/minimind"/>
701
+ <source media="(prefers-color-scheme: light)" srcset="https://reporoster.com/forks/jingyaogong/minimind"/>
702
+ <img alt="github contribution grid snake animation" src="https://reporoster.com/forks/jingyaogong/minimind"/>
703
+ </picture>
704
+ </a>
705
+
706
+ <picture>
707
+ <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=jingyaogong/minimind&type=Date&theme=dark"/>
708
+ <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=jingyaogong/minimind&type=Date"/>
709
+ <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=jingyaogong/minimind&type=Date"/>
710
+ </picture>
711
+
712
+ # License
713
 
714
+ This repository is licensed under the [Apache-2.0 License](LICENSE).
715
 
 
716
 
 
README_en.md CHANGED
@@ -1,4 +1,9 @@
 
 
1
  ![logo](./images/logo.png)
 
 
 
2
  <div align="center">
3
 
4
  ![visitors](https://visitor-badge.laobi.icu/badge?page_id=jingyaogong/minimind)
@@ -21,7 +26,6 @@
21
 
22
  </div>
23
 
24
-
25
  * This open-source project aims to train a miniature language model **MiniMind** from scratch, with a size of just 26MB.
26
  * **MiniMind** is extremely lightweight, approximately $\frac{1}{7000}$ the size of GPT-3, designed to enable fast
27
  inference and even training on CPUs.
@@ -32,6 +36,14 @@
32
 
33
  ---
34
 
 
 
 
 
 
 
 
 
35
  # 📌 Introduction
36
 
37
  In the field of large language models (LLMs) such as GPT, LLaMA, GLM, etc., while their performance is impressive, the
@@ -46,24 +58,24 @@ exacerbates the problem of finding quality content to understand LLMs, severely
46
  Therefore, the goal of this project is to lower the barrier to entry for working with LLMs as much as possible, by
47
  training an extremely lightweight language model from scratch.
48
 
49
- (As of August 28, 2024) The initial release of MiniMind includes four model variants, with the smallest being just
50
- 26MB (0.02B) and still exhibiting amazing conversational capabilities!
51
 
52
- | Model (Size) | Speed (Tokens/s) | Inference Memory | Training Memory (`batch_size=8`) |
53
- |------------------------|------------------|------------------|----------------------------------|
54
- | MiniMind-small-T (26M) | 91.9 | 0.5 GB | 3.6 GB |
55
- | MiniMind-small (56M) | 85.2 | 0.7 GB | 4.5 GB |
56
- | MiniMind (218M) | 57.6 | 2.1 GB | 10.4 GB |
57
- | MiniMind-MoE (166M) | 64.9 | 1.6 GB | 7.4 GB |
58
 
59
- > This analysis was conducted on an RTX 3090 GPU with Torch 2.1.2, CUDA 12.2, and Flash Attention 2.
60
 
61
  The project includes:
62
 
63
  - Public MiniMind model code (including Dense and MoE models), code for Pretrain, SFT instruction fine-tuning, LoRA
64
  fine-tuning, and DPO preference optimization, along with datasets and sources.
65
  - Compatibility with popular frameworks such as `transformers`, `accelerate`, `trl`, and `peft`.
66
- - Training support for single-GPU and multi-GPU setups. The training process allows for stopping and resuming at any
 
67
  point.
68
  - Code for testing the model on the Ceval dataset.
69
  - Implementation of a basic chat interface compatible with OpenAI's API, facilitating integration into third-party Chat
@@ -71,11 +83,31 @@ The project includes:
71
 
72
  We hope this open-source project helps LLM beginners get started quickly!
73
 
74
- 👉**Recent Updates**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
76
  <details close>
77
- <summary> <b>2024-08-28</b> </summary>
78
- - Project first open-sourced
 
 
79
  </details>
80
 
81
  # 📌 Environment
@@ -88,18 +120,23 @@ These are my personal software and hardware environment configurations. Please a
88
  * CUDA == 12.2
89
  * [requirements.txt](./requirements.txt)
90
 
91
- # 📌 Start Inference
92
 
93
  <div align="center" style="font-size: 1.5em; font-weight: bold;">
94
  <img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" alt="Hugging Face Logo" style="vertical-align: middle; height: 30px;" />
95
  Hugging Face
96
 
97
- [MiniMind-Collection](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5)
 
 
 
 
 
98
  </div>
99
 
100
  ```bash
101
  # step 1
102
- git clone https://huggingface.co/jingyaogong/minimind
103
  ```
104
 
105
  ```bash
@@ -107,8 +144,33 @@ git clone https://huggingface.co/jingyaogong/minimind
107
  python 2-eval.py
108
  ```
109
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
  # 📌 Quick Start
111
 
 
 
 
 
 
 
112
  *
113
  1. Clone the project code
114
 
@@ -119,7 +181,7 @@ git clone https://github.com/jingyaogong/minimind.git
119
  *
120
  2. If you need to train the model yourself
121
 
122
- * 2.1 Download the [dataset download link](#dataset-download-link) and place it in the `./dataset` directory.
123
 
124
  * 2.2 Run `python data_process.py` to process the dataset, such as token-encoding pretrain data and extracting QA
125
  data to CSV files for the SFT dataset.
@@ -133,27 +195,21 @@ git clone https://github.com/jingyaogong/minimind.git
133
  *
134
  3. Test model inference performance
135
 
136
- * Download the weights from the [trained model weights](#trained-model-weights) section below and place them in
137
- the `./out/` directory
138
  ```text
139
  out
140
  ├── multi_chat
141
- │   ├── full_sft_1024.pth
142
  │   ├── full_sft_512.pth
143
- │   ├── full_sft_640_moe.pth
144
- │   └── full_sft_640.pth
145
  ├── single_chat
146
- │   ├── full_sft_1024.pth
147
  │   ├── full_sft_512.pth
148
- │   ├── full_sft_640_moe.pth
149
- │   └── full_sft_640.pth
150
- ├── full_sft_1024.pth
151
- ├── full_sft_512.pth
152
- ├── full_sft_640_moe.pth
153
- ├── full_sft_640.pth
154
- ├── pretrain_1024.pth
155
- ├── pretrain_640_moe.pth
156
- ├── pretrain_640.pth
157
  ```
158
 
159
  * Test the pretraining model's chain effect with `python 0-eval_pretrain.py`
@@ -162,15 +218,18 @@ git clone https://github.com/jingyaogong/minimind.git
162
 
163
  🍭 **Tip**: Pretraining and full parameter fine-tuning (`pretrain` and `full_sft`) support DDP multi-GPU acceleration.
164
 
165
- * Start training on a single machine with N GPUs
166
-
167
- ```text
168
  torchrun --nproc_per_node N 1-pretrain.py
169
- ```
170
-
171
- ```text
172
  torchrun --nproc_per_node N 3-full_sft.py
173
  ```
 
 
 
 
 
 
174
 
175
  # 📌 Data sources
176
 
@@ -191,28 +250,26 @@ git clone https://github.com/jingyaogong/minimind.git
191
 
192
  Powerful open-source models like 01万物, 千问, chatglm, mistral, and Llama3 have the following tokenizer vocabulary
193
  sizes:
194
-
195
- | Tokenizer Model | Vocabulary Size | Source |
196
- |----------------------|------------------|-----------------------|
197
- | yi tokenizer | 64,000 | 01-AI (China) |
198
- | qwen2 tokenizer | 151,643 | Alibaba Cloud (China) |
199
- | glm tokenizer | 151,329 | Zhipu AI (China) |
200
- | mistral tokenizer | 32,000 | Mistral AI (France) |
201
- | llama3 tokenizer | 128,000 | Meta (USA) |
202
- | minimind tokenizer | 6,400 | Custom |
203
-
204
- > Although Mistral’s Chinese vocabulary proportion is small and its encoding/decoding efficiency is weaker than
205
- Chinese-friendly tokenizers like qwen2 and glm, MiniMind chose the Mistral tokenizer to keep the overall model
206
- lightweight and avoid being top-heavy, as Mistral’s vocabulary size is only 32,000. MiniMind has shown excellent
207
- performance in practical tests, with almost no failures in decoding rare words.
208
-
209
- > For comparison purposes, an additional custom Tokenizer version **MiniMind(-T)** was trained, reducing the
210
- vocabulary size to 6,400, which further decreases the total model parameters to around 26M.
211
 
212
  ---
213
 
214
  - 📙 **[Pretrain Data](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md)**:
215
- The [Seq-Monkey General Text Dataset](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md)
216
  is a collection of data from various public sources such as websites, encyclopedias, blogs, open-source code, books,
217
  etc. It has been compiled, cleaned, and organized into a unified JSONL format, with rigorous filtering and
218
  deduplication to ensure data comprehensiveness, scale, reliability, and high quality. The total amount is
@@ -252,12 +309,13 @@ git clone https://github.com/jingyaogong/minimind.git
252
 
253
  ### Dataset Download Links
254
 
255
- | MiniMind Training Dataset | Download Link |
256
- |---------------------------|------------------------------------------------------------------------------------------------------------------------------------|
257
- | **[Pretrain Data]** | [Seq-Monkey General Text Dataset](http://share.mobvoi.com:5000/sharing/O91blwPkY) |
258
- | **[SFT Data]** | [Jiangshu Large Model SFT Dataset](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data/resolve/master/sft_data_zh.jsonl) |
259
- | **[DPO Data]** | [Huozi Dataset 1](https://huggingface.co/datasets/Skepsun/huozi_rlhf_data_json) |
260
- | **[DPO Data]** | [Huozi Dataset 2](https://huggingface.co/datasets/beyond/rlhf-reward-single-round-trans_chinese) |
 
261
 
262
  # 📌 Model
263
 
@@ -288,15 +346,12 @@ and FFN layer code. The structure is illustrated in the figure below (redrawn):
288
  Model configurations can be found in [./model/LMConfig.py](./model/LMConfig.py). The model types and parameters are
289
  shown in the table below:
290
 
291
- | Model Name | Params | Vocabulary Size | Layers | Model Dimension | KV Heads | Query Heads | Share+Route | TopK |
292
- |------------------|--------|-----------------|--------|-----------------|----------|-------------|-------------|------|
293
- | minimind-small-T | 26M | 6400 | 8 | 512 | 8 | 16 | - | - |
294
- | minimind-small | 56M | 32000 | 8 | 640 | 8 | 16 | - | - |
295
- | minimind | 218M | 32000 | 16 | 1024 | 8 | 16 | - | - |
296
- | minimind-MoE | 166M | 32000 | 8 | 640 | 8 | 16 | 2+4 | 2 |
297
 
298
- For reference, the configuration details for GPT-3 are shown in the table below:
299
- ![gpt3_config.png](./images/gpt3_config.png)
300
 
301
  # 📌 Experiment
302
 
@@ -307,12 +362,11 @@ GPU: NVIDIA GeForce RTX 3090 (24GB) * 2
307
  Environment: python 3.9 + Torch 2.1.2 + DDP multi-GPU training
308
  ```
309
 
310
- | Model Name | params | len_vocab | batch_size | pretrain_time | sft_single_time | sft_multi_time |
311
- |------------------|--------|-----------|------------|--------------------|-------------------|---------------------|
312
- | minimind-small-T | 26M | 6400 | 64 | ≈5 hour (1 epoch) | ≈2 hour (1 epoch) | ≈0.5 hour (1 epoch) |
313
- | minimind-small | 56M | 32000 | 24 | ≈6 hour (1 epoch) | ≈2 hour (1 epoch) | ≈0.5 hour (1 epoch) |
314
- | minimind | 218M | 32000 | 16 | ≈15 hour (1 epoch) | ≈5 hour (1 epoch) | ≈1 hour (1 epoch) |
315
- | minimind-MoE | 166M | 32000 | 16 | ≈13 hour (1 epoch) | ≈5 hour (1 epoch) | ≈1 hour (1 epoch) |
316
 
317
  ---
318
 
@@ -374,221 +428,146 @@ Environment: python 3.9 + Torch 2.1.2 + DDP multi-GPU training
374
  ```bash
375
  python 5-dpo_train.py
376
  ```
377
-
378
- ---
379
- 🔗 **Trained Model Weights**:
380
-
381
- | Model Name | params | Config | pretrain_model | single_sft_model | multi_sft_model |
382
- |------------------|--------|-------------------------------------------------|----------------------------------------------------------------|----------------------------------------------------------------|----------------------------------------------------------------|
383
- | minimind-small-T | 26M | d_model=512<br/>n_layers=8 | - | [链接](https://pan.baidu.com/s/1_COe0FQRDmeapSsvArahCA?pwd=6666) | [链接](https://pan.baidu.com/s/1GsGsWSL0Dckl0YPRXiBIFQ?pwd=6666) |
384
- | minimind-small | 56M | d_model=640<br/>n_layers=8 | [链接](https://pan.baidu.com/s/1nJuOpnu5115FDuz6Ewbeqg?pwd=6666) | [链接](https://pan.baidu.com/s/1lRX0IcpjNFSySioeCfifRQ?pwd=6666) | [链接](https://pan.baidu.com/s/1LzVxBpL0phtGUH267Undqw?pwd=6666) |
385
- | minimind | 218M | d_model=1024<br/>n_layers=16 | [链接](https://pan.baidu.com/s/1jzA7uLEi-Jen2fW5olCmEg?pwd=6666) | [链接](https://pan.baidu.com/s/1Hvt0Q_UB_uW2sWTw6w1zRQ?pwd=6666) | [链接](https://pan.baidu.com/s/1fau9eat3lXilnrG3XNhG5Q?pwd=6666) |
386
- | minimind-MoE | 166M | d_model=1024<br/>n_layers=8<br/>share+route=2+4 | [链接](https://pan.baidu.com/s/11CneDVTkw2Y6lNilQX5bWw?pwd=6666) | [链接](https://pan.baidu.com/s/1fRq4MHZec3z-oLK6sCzj_A?pwd=6666) | [链接](https://pan.baidu.com/s/1HC2KSM_-RHRtgv7ZDkKI9Q?pwd=6666) |
387
-
388
  ---
 
 
389
 
390
- Regarding the parameter configuration of LLMs, an interesting paper [MobileLLM](https://arxiv.org/pdf/2402.14905) has
391
- conducted detailed research and experiments.
392
-
393
- The scaling laws exhibit unique patterns in small models.
394
-
395
- The parameters that cause the scaling of Transformer models almost solely depend on `d_model` and `n_layers`.
396
 
397
- * `d_model`↑ + `n_layers`↓ -> Short and Fat
398
- * `d_model`↓ + `n_layers`↑ -> Tall and Thin
 
 
 
399
 
400
- The paper proposing the Scaling Law in 2020 suggested that the amount of training data, the number of parameters, and
401
- the number of training iterations are the key factors determining performance, while the impact of model architecture
402
- can be nearly ignored. However, this law does not seem to fully apply to small models.
403
 
404
- MobileLLM proposes that the depth of the architecture is more important than its width. A "deep and narrow" model can
405
- learn more abstract concepts compared to a "wide and shallow" model. For example, when the model parameters are fixed at
406
- 125M or 350M, a "narrow" model with 30–42 layers performs significantly better than a "short and fat" model with around
407
- 12 layers. This trend is observed across eight benchmark tests, including common-sense reasoning, question answering,
408
- and reading comprehension.
409
 
410
- This is a very interesting finding, as previously, almost no one attempted to stack more than 12 layers when designing
411
- architectures for models around the 100M parameter scale.
412
 
413
- This observation aligns with the results of MiniMind, where experiments adjusting the model parameter quantities
414
- between `d_model` and `n_layers` during training observed similar effects. However, there is a dimensional limit to "
415
- deep and narrow" models. When `d_model` < 512, the drawbacks of collapsing word embedding dimensions are quite
416
- pronounced. Increasing the number of layers cannot compensate for the deficiencies in `d_head` caused by fixed `q_head`.
417
 
418
- When `d_model` > 1536, increasing the number of layers seems to take precedence over `d_model`, providing a more "
419
- cost-effective" parameter-to-performance gain. Therefore, MiniMind sets the `d_model` of the small model to 640
420
- and `n_layers` to 8 to achieve a balance of "small size -> better performance". Setting `d_model` to 1024 and `n_layers`
421
- to 16 provides greater performance gains, aligning better with the scaling-law curve for small models.
 
422
 
423
  ---
424
 
 
425
  # 📌 Eval
426
 
427
- [A] [minimind-small-T(0.02B)](https://pan.baidu.com/s/1_COe0FQRDmeapSsvArahCA?pwd=6666)<br/>
428
- [B] [minimind-small(0.05B)](https://pan.baidu.com/s/1lRX0IcpjNFSySioeCfifRQ?pwd=6666)<br/>
429
- [C] [minimind-MoE(0.16B)](https://pan.baidu.com/s/1fRq4MHZec3z-oLK6sCzj_A?pwd=6666)<br/>
430
- [D] [minimind(0.2B)](https://pan.baidu.com/s/1Hvt0Q_UB_uW2sWTw6w1zRQ?pwd=6666)<br/>
431
- [E] [baby-llama2-chinese(0.2B)](https://github.com/DLLXW/baby-llama2-chinese)<br/>
432
- [F] [chatlm-mini-chinese(0.2B)](https://github.com/charent/ChatLM-mini-Chinese)<br/>
433
 
434
- ```
435
- [Q]: How many days are there in a week?
436
- [A]: There are seven days in a week.
437
- [B]: There are 7 days in a week.
438
- [C]: According to your calculation, there are 12 hours in a week. Therefore, we need to multiply the number of hours by 12. This will give us the number of hours in 12.
439
- [D]: There are seven days in a week.
440
- [E]: As an AI assistant, I cannot determine the number of weeks you are referring to. If you can provide the number of weeks you are describing, I can provide information about the number of days in that week.
441
- [F]: There are seven days in a week, which is the product of seven days, i.e., 7 x 7 = 24 days. This number is obtained by dividing 24 by 7, because a week has seven days, so a week has 24 x 7 = 24 days.
442
-
443
- [Q]: Do you know about the Yangtze River?
444
- [A]: Yes, the Yangtze River is the longest river in China, with a length of over 3300 kilometers.
445
- [B]: Yes, the Yangtze River is the longest river in China, with a length of approximately 7100 kilometers. Its basin area is about 33.28 million square kilometers, making it one of the largest rivers in China and the third-longest river in the world, with a total length of about 6300 kilometers.
446
- [C]: I know. The Yangtze River is an important water system in China, with a long history and culture. The Yangtze River is a symbol and spiritual pillar of the Chinese nation, and it is also one of the most beautiful natural landscapes in the world. People living along the Yangtze River, especially those in the Yangtze River basin, have lived there for thousands of years. The Yangtze River has deep historical and cultural significance in China, especially in its ancient culture.
447
- [D]: The Yangtze River is the largest river in China, with a length of about 6380 kilometers. It is the longest river in Asia and the largest river in China. The Yangtze River basin is one of China’s important river basins, rich in natural resources and cultural heritage. The Yangtze River basin is a major agricultural and economic development area in China, with numerous cultural heritage sites and historical attractions, making it one of China's most important cultural tourism destinations.
448
- [E]: Also, China is one of the longest cities in the world, with a population of over 34 million.
449
- [F]: Yes, the Yangtze River is the longest river in China and one of the longest rivers in the world. It originates from the Tibetan Plateau, flows through nine provinces and regions, and finally empties into the East China Sea. The Yangtze River basin has a mild climate with distinct seasons, making it an important agricultural and industrial base in China. The Yangtze River basin is rich in aquatic resources and is one of China’s important fishing resources.
450
 
451
  [Q]: Which country launched the first artificial satellite?
452
  [A]: The first artificial satellite was Sputnik 1.
453
  [B]: The first artificial satellite was launched by the United States.
454
- [C]: The first artificial satellite was launched by the China National Space Administration, which is a satellite launched by the China National Space Administration.
455
- [D]: The first artificial satellite was launched by the Soviet Union in 1957.
456
- [E]: The satellite named "Armstrong" was launched on November 20, 1961, and was the first to successfully send humans into space.
457
- [F]: The first artificial satellite was launched by the Soviet Union on April 4, 1957. It was launched by Soviet astronaut Yuri Gagarin and Soviet engineer Roald Skohley. During the launch, Gagarin and Skohley used a small satellite named "Sputnik 1," which had a diameter of 4.3 meters and weighed 1,051 kilograms. This satellite, called "Sputnik 1," was a milestone in the Soviet space race and marked the first time humans conducted a spacewalk in space.
458
 
459
  [Q]: Do you know where Mount Tai is located?
460
- [A]: Mount Tai is located in the eastern part of China and is one of the Five Great Mountains of China.
461
- [B]: Mount Tai is located in Tai'an, Shandong Province, and is one of the Five Great Mountains of China.
462
- [C]: Where is Mount Tai?
463
- [D]: Mount Tai is located in Kunshan, Shandong Province, China. It is one of the treasures of ancient Chinese culture and one of the world-famous mountain scenic spots. Mount Tai is one of the Five Great Mountains, known for its steep peaks, beautiful natural scenery, and rich cultural heritage.
464
- [E]: As an AI assistant, I do not have internet access and therefore cannot provide information about your location. However, you can search or look up related keywords to obtain information about your location, or check news or websites for information about your location.
465
- [F]: Mount Tai is located in Tai'an, Shandong Province, China, and is one of the Five Great Mountains of China. It is also one of the important symbols of Chinese culture and history. Mount Tai is one of the Five Great Mountains, and it is one of the origins of Taoism and Buddhism in China. The elevation of Mount Tai is 1,545 meters, making it one of the most famous tourist attractions in China. Its history dates back to the 5th century BC and was a place for ancient emperors to offer sacrifices. Mount Tai is also an important part of Chinese culture, known as the "First of the Five Great Mountains," and is one of China's important cultural heritage sites.
466
 
467
  [Q]: What is the highest mountain in the world?
468
- [A]: The highest mountain in the world is Mount Everest, located on the border between Nepal and China, with an elevation of 8848 meters.
469
- [B]: Mount Everest is the highest mountain in the world, located in the Himalayas, with an elevation of 8848 meters.
470
- [C]: The highest mountain in the world is Mount Everest.
471
- [D]: The highest mountain in the world is Mount Everest, located in the Himalayas, with an elevation of 8,848.86 meters.
472
- [E]: The highest mountain in the world is Mount Everest, located in the Himalayas, with an elevation of 8,848.86 meters (29,031.69 feet).
473
- [F]: The highest mountain in the world is Mount Everest, also known as Sagarmatha, located on the border between Nepal and China in the Himalayas, with an elevation of 8,848.86 meters (29,031.69 feet). It is one of the most famous climbing targets in the world, attracting climbers from around the globe. The name Everest comes from the Tibetan language and means "Goddess's forehead peak." It consists of two different peaks, one in Nepal and the other in China. Its height is determined by measuring sea level elevation, as it is calculated based on Earth's average height. Although Mount Everest is the highest mountain in the world, it is not accessible from any other place on Earth. Other places, such as the Andes Mountains in South America, are also high mountains, but they are not the highest mountains in the world. In summary, Mount Everest is the highest mountain in the world, but it is not accessible from anywhere else on Earth. It is the highest mountain range on Earth, but it is not accessible from anywhere else on Earth.
474
 
475
  [Q]: Who proposed the law of universal gravitation?
476
- [A]: The law of universal gravitation was proposed by British physicist Newton. He believed that all objects produce gravitational force, which is determined by the mass of the objects and their distance. This theory reveals the relationship between mass and distance and proves the nature of universal gravitation.
477
- [B]: The law of universal gravitation was discovered by Newton and was proposed by British physicist James Watt in 1879.
478
- [C]: Who proposed the law of universal gravitation?
479
- [D]: The law of universal gravitation was proposed by Newton, who discovered that every object exerts a gravitational force on other objects, and the magnitude of this force depends on the mass and distance of the objects.
480
- [E]: Gravity is a force in the universe composed of mass or energy. According to Newton's law of universal gravitation, the gravitational force between two objects is directly proportional to their mass and inversely proportional to their distance. Therefore, Newton's law of universal gravitation shows that the gravitational force between two objects is directly proportional to their mass and inversely proportional to their distance. This law has wide applications in physics, astronomy, and other fields.
481
- [F]: The law of universal gravitation was proposed by British physicist Isaac Newton in 1687. He published a famous paper titled "Mathematical Principles of Natural Philosophy" in 1687, in which he proposed the law of universal gravitation, stating that there is a gravitational force between any two objects, which is directly proportional to their mass and inversely proportional to the square of their distance. This law, known as the law of universal gravitation, describes the relationship between mass and distance. It shows that the gravitational force between two objects is proportional to their mass and inversely proportional to the square of the distance between them. This law is crucial for explaining many phenomena in the universe, such as planetary motion around the sun, gravitational lensing between stars, etc. It is also widely used in astronomy, physics, and engineering, providing an important theoretical foundation for understanding the nature and evolution of the universe.
482
 
483
  [Q]: Do you know what the main food of giant pandas is?
484
- [A]: The main food of giant pandas is bamboo. They typically eat bamboo and need to consume about 30 kilograms of bamboo daily.
485
- [B]: Giant pandas primarily eat bamboo. Their digestive system is quite specialized, and they need to digest a large amount of bamboo daily to meet their survival needs. Besides bamboo, giant pandas also survive in the wild.
486
- [C]: The main food of giant pandas includes bamboo, mushrooms, beetroot, and mushrooms.
487
- [D]: Giant pandas are herbivorous mammals that mainly eat bamboo, including bamboo shoots, bamboo leaves, and stems.
488
- [E]: The main food of giant pandas is bamboo.
489
- [F]: Giant pandas are herbivores whose main food is bamboo. Their digestive system is adapted to digest high-fiber bamboo, but they also eat some other plants, such as herbaceous plants and small mammals. The digestive system of giant pandas is adapted to high-fiber bamboo, which allows them to digest large amounts of cellulose, helping them survive in the wild. Additionally, giant pandas have an enzyme called "cellulase" in their digestive tract that helps them break down cellulose, aiding in the digestion of a lot of water and nutrients. Their digestive tract also contains a protein called "cellulase protein," which helps maintain their energy levels, supporting their survival in the wild. Overall, the main food of giant pandas is bamboo, but they also consume some other plants, such as herbaceous plants and small mammals.
490
 
491
  [Q]: Why is seawater salty?
492
- [A]: Seawater contains many salts, which can lead to an increase in the salinity of seawater. The salts in seawater primarily come from the salt content in seawater, and these salts can form hydrogen bonds with the seawater, increasing its salinity.
493
- [B]: Seawater contains rich elements like calcium, phosphorus, potassium, and phosphorus, which are limited in quantity and distribution in the ocean. Additionally, seawater contains various nutrients from organisms and microorganisms, such as shells and seaweed.
494
- [C]: The salinity of seawater is higher than that of seawater because of its high carbonate and base levels. The carbonate and base levels in seawater are higher than in seawater, making the seawater salty.
495
- [D]: Seawater contains dissolved salts and other minerals, which come from the erosion and adsorption of the Earth's crust, and these substances dissolve in the water, causing the seawater to taste salty.
496
- [E]: Water is essential for many living organisms as it can absorb and release carbon dioxide, release oxygen, maintain atmospheric water vapor, and regulate the climate and environment. The salts and chemicals in water release oxygen, which allows organisms on Earth to breathe. Additionally, water plays many other roles, such as maintaining the water cycle, regulating the climate, and supporting the environment. Water is essential for many living organisms and is also a fundamental aspect of many natural phenomena.
497
- [F]: Seawater is salty primarily because it contains a large amount of dissolved salts, with sodium chloride (NaCl) being the most prominent. Sodium chloride is a common type of salt found in many places on Earth, including oceans, lakes, rivers, and groundwater. When water evaporates from the ocean, it carries away some of the salts, making the seawater salty. Additionally, seawater contains other dissolved salts, such as sulfate, magnesium sulfate, calcium sulfate, and potassium sulfate.
498
  ```
499
 
500
- 🙋‍♂️Directly throw the answer of the above model to GPT-4o and ask it to help score it:
 
501
 
502
  ---
 
503
 
504
- According to your request, I will evaluate the performance of each model based on accuracy, clarity, and completeness,
505
- and provide ratings and rankings.
506
-
507
- ### Model Performance Ratings and Summary
508
-
509
- **Model A**
510
-
511
- - **Accuracy**: Most answers are accurate, but there are occasional minor errors.
512
- - **Clarity**: Responses are concise and fluent.
513
- - **Completeness**: Responses are sometimes slightly simplistic, but overall the information is sufficient.
514
- - **Score**: 80
515
-
516
- **Model B**
517
-
518
- - **Accuracy**: Most answers are accurate, but there are some minor errors (e.g., regarding the first artificial
519
- satellite).
520
- - **Clarity**: Language is relatively clear, but sometimes expressions are slightly confusing.
521
- - **Completeness**: Responses are fairly comprehensive but contain information discrepancies.
522
- - **Score**: 75
523
-
524
- **Model C**
525
-
526
- - **Accuracy**: Responses are inaccurate, with several instances of self-asking and answering.
527
- - **Clarity**: Language is fluent, but the logic of responses is poor.
528
- - **Completeness**: Information is incomplete and sometimes lacks important details.
529
- - **Score**: 55
530
-
531
- **Model D**
532
-
533
- - **Accuracy**: Most answers are accurate and generally correct.
534
- - **Clarity**: Expression is clear, with appropriate information density.
535
- - **Completeness**: Responses are relatively complete, but some answers might include unnecessary details.
536
- - **Score**: 85
537
-
538
- **Model E**
539
 
540
- - **Accuracy**: Accuracy is lower, with some answers unrelated to the questions.
541
- - **Clarity**: Expression is unclear and can cause confusion.
542
- - **Completeness**: Information is incomplete and sometimes deviates from the topic.
543
- - **Score**: 50
544
 
545
- **Model F**
 
 
546
 
547
- - **Accuracy**: Some answers are inaccurate, with notable errors (e.g., "24 days").
548
- - **Clarity**: Expression is lengthy and can cause confusion.
549
- - **Completeness**: Information is excessively lengthy and repetitive, reducing readability.
550
- - **Score**: 60
551
 
552
- ### Ranking (from highest to lowest):
 
 
553
 
554
- 1. **Model D** - 85
555
- 2. **Model A** - 80
556
- 3. **Model B** - 75
557
- 4. **Model F** - 60
558
- 5. **Model C** - 55
559
- 6. **Model E** - 50
560
 
561
- These scores and rankings are based on each model’s overall performance in accuracy, clarity, and completeness.
 
 
562
 
563
  ---
564
 
565
- ### 👉 Summary of Results
566
 
567
- * The ranking of the minimind series (ABCD) is intuitive. minimind(0.2B) scores the highest, with minimal errors and
568
- hallucinations in answering common-sense questions.
569
- * Surprisingly, minimind-small-T (0.02B) with only 26M parameters performs close to minimind(0.2B).
570
- * minimind(0.2B) had less than 2 epochs of SFT training. Despite the training time being several times that of
571
- 0.02B, the model was terminated early to free up resources for smaller models. Even with insufficient training,
572
- 0.2B achieved the best performance, highlighting the impact of model size.
573
- * minimind-MoE (0.16B) performed poorly, even worse than its dense counterpart minimind(0.05B). This isn't due to
574
- the MoE approach itself but rather because the model was terminated early due to resource constraints. MoE models
575
- typically require more training epochs, and with only 2 epochs, the training was extremely insufficient. A
576
- well-trained MoE version was previously tested on Yi tokenizer and showed visibly better performance compared to
577
- dense models. Further training for updates to v2 and v3 versions will be conducted when server resources are
578
- available.
579
 
580
- * Model F's responses appear the most complete, despite some hallucinations. Both GPT-4o and Kimi's evaluations agree
581
- that it is "overly verbose with repetitive content and contains hallucinations." This evaluation may seem too
582
- harsh—hallucinations accounting for 10 out of 100 words can unfairly lead to a 0 score. Model F, having a default
583
- longer training text and a much larger dataset, provides seemingly more complete answers, with data proving more
584
- crucial than model size in similar contexts.
585
 
586
- > 🙋‍♂️ Personal subjective rating: F > D > A ≈ B > C > E
587
 
588
- > 🤖 GPT-4o rating: D > A > B > F > C > E
589
 
590
- In summary, the scaling law suggests that larger model parameters and more training data generally lead to stronger
591
- model performance.
592
 
593
  # 📌 Objective Dataset: C-Eval
594
 
@@ -599,62 +578,16 @@ four tokens `A`, `B`, `C`, `D`, and choose the one with the highest probability
599
  against the standard answer. Note that minimind models were not trained on larger datasets or fine-tuned for question
600
  answering, so results should be considered as reference only.
601
 
602
- >For example, detailed results for minimind-small:
603
-
604
- | category | Correct/Total | Accuracy |
605
- |----------------------------------------------|---------------|----------|
606
- | probability_and_statistics_val | 3/18 | 16.67% |
607
- | law_val | 5/24 | 20.83% |
608
- | middle_school_biology_val | 4/21 | 19.05% |
609
- | high_school_chemistry_val | 7/19 | 36.84% |
610
- | high_school_physics_val | 5/19 | 26.32% |
611
- | legal_professional_val | 2/23 | 8.70% |
612
- | high_school_chinese_val | 4/19 | 21.05% |
613
- | high_school_history_val | 6/20 | 30.00% |
614
- | tax_accountant_val | 10/49 | 20.41% |
615
- | modern_chinese_history_val | 4/23 | 17.39% |
616
- | middle_school_physics_val | 4/19 | 21.05% |
617
- | middle_school_history_val | 4/22 | 18.18% |
618
- | basic_medicine_val | 1/19 | 5.26% |
619
- | operating_system_val | 3/19 | 15.79% |
620
- | logic_val | 4/22 | 18.18% |
621
- | electrical_engineer_val | 7/37 | 18.92% |
622
- | civil_servant_val | 11/47 | 23.40% |
623
- | chinese_language_and_literature_val | 5/23 | 21.74% |
624
- | college_programming_val | 10/37 | 27.03% |
625
- | accountant_val | 9/49 | 18.37% |
626
- | plant_protection_val | 7/22 | 31.82% |
627
- | middle_school_chemistry_val | 4/20 | 20.00% |
628
- | metrology_engineer_val | 3/24 | 12.50% |
629
- | veterinary_medicine_val | 6/23 | 26.09% |
630
- | marxism_val | 5/19 | 26.32% |
631
- | advanced_mathematics_val | 5/19 | 26.32% |
632
- | high_school_mathematics_val | 4/18 | 22.22% |
633
- | business_administration_val | 8/33 | 24.24% |
634
- | mao_zedong_thought_val | 8/24 | 33.33% |
635
- | ideological_and_moral_cultivation_val | 5/19 | 26.32% |
636
- | college_economics_val | 17/55 | 30.91% |
637
- | professional_tour_guide_val | 10/29 | 34.48% |
638
- | environmental_impact_assessment_engineer_val | 7/31 | 22.58% |
639
- | computer_architecture_val | 6/21 | 28.57% |
640
- | urban_and_rural_planner_val | 11/46 | 23.91% |
641
- | college_physics_val | 5/19 | 26.32% |
642
- | middle_school_mathematics_val | 3/19 | 15.79% |
643
- | high_school_politics_val | 4/19 | 21.05% |
644
- | physician_val | 13/49 | 26.53% |
645
- | college_chemistry_val | 3/24 | 12.50% |
646
- | high_school_biology_val | 5/19 | 26.32% |
647
- | high_school_geography_val | 4/19 | 21.05% |
648
- | middle_school_politics_val | 6/21 | 28.57% |
649
- | clinical_medicine_val | 6/22 | 27.27% |
650
- | computer_network_val | 2/19 | 10.53% |
651
- | sports_science_val | 2/19 | 10.53% |
652
- | art_studies_val | 14/33 | 42.42% |
653
- | teacher_qualification_val | 12/44 | 27.27% |
654
- | discrete_mathematics_val | 6/16 | 37.50% |
655
- | education_science_val | 7/29 | 24.14% |
656
- | fire_engineer_val | 9/31 | 29.03% |
657
- | middle_school_geography_val | 1/12 | 8.33% |
658
 
659
  **Total number of questions**: 1346
660
 
@@ -666,12 +599,11 @@ answering, so results should be considered as reference only.
666
 
667
  #### Results summary:
668
 
669
- | category | correct | question_count | accuracy |
670
- |:-----------------|:--------:|:--------------:|:---------:|
671
- | minimind-small-T | 344 | 1346 | 25.56% |
672
- | minimind-small | 312 | 1346 | 23.18% |
673
- | minimind | 351 | 1346 | 26.08% |
674
- | minimind-moe | 316 | 1346 | 23.48% |
675
 
676
  ### Model Performance Insights from GPT-4o
677
 
@@ -697,22 +629,24 @@ answering, so results should be considered as reference only.
697
  This suggests that the model performs well in logical reasoning, foundational sciences, and some engineering disciplines but is weaker in humanities, social sciences, and certain specialized fields (such as law and taxation). To improve the model's performance, additional training in humanities, physics, law, and environmental science may be beneficial.
698
  ```
699
 
700
-
701
  # 📌 Others
702
 
703
  ### Inference and Export
704
 
705
  * [./export_model.py](./export_model.py) can export the model to the transformers format and push it to Hugging Face.
706
 
707
- * MiniMind's Hugging Face collection address: [MiniMind](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5)
 
708
 
709
  ---
710
 
711
  ### API Inference
712
 
713
- [./my_openai_api.py](./my_openai_api.py) provides a chat interface for the OpenAI API, making it easier to integrate your model with third-party UIs, such as fastgpt, OpenWebUI, etc.
 
714
 
715
- * Download the model weight files from [Hugging Face](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5):
 
716
  ```
717
  minimind (root dir)
718
  ├─minimind
@@ -758,31 +692,56 @@ This suggests that the model performs well in logical reasoning, foundational sc
758
 
759
  ---
760
 
761
- # 📌 Acknowledge
762
 
763
- If you find this project helpful, feel free to give it a Star 🎉✨.
 
 
 
 
764
 
765
- The document is somewhat lengthy and may contain errors due to my limited proficiency. Please feel free to open issues for discussion or criticism.
766
 
767
- Special thanks to the following open-source projects for their inspiration and datasets:
 
 
 
 
768
 
769
- * [baby-llama2-chinese](https://github.com/DLLXW/baby-llama2-chinese)
770
- * [ChatLM-mini-Chinese](https://github.com/charent/ChatLM-mini-Chinese)
771
- * [Zero-Chatgpt](https://github.com/AI-Study-Han/Zero-Chatgpt/tree/main)
772
 
773
 
774
- ## ✨Top contributors
775
- <a href="https://github.com/jingyaogong/minimind/graphs/contributors">
776
- <img src="https://contrib.rocks/image?repo=jingyaogong/minimind" />
777
- </a>
778
 
779
- # 📌 Statement
 
780
 
781
- This project does not assume responsibility for data security, public opinion risks, or any risks and liabilities arising from model misguidance, misuse, dissemination, or improper use related to open-source models and code.
782
 
 
 
 
 
 
 
 
783
 
 
 
 
 
 
 
 
784
 
 
 
 
 
 
785
 
786
- ## License
787
 
788
  This repository is licensed under the [Apache-2.0 License](LICENSE).
 
1
+ <div align="center">
2
+
3
  ![logo](./images/logo.png)
4
+
5
+ </div>
6
+
7
  <div align="center">
8
 
9
  ![visitors](https://visitor-badge.laobi.icu/badge?page_id=jingyaogong/minimind)
 
26
 
27
  </div>
28
 
 
29
  * This open-source project aims to train a miniature language model **MiniMind** from scratch, with a size of just 26MB.
30
  * **MiniMind** is extremely lightweight, approximately $\frac{1}{7000}$ the size of GPT-3, designed to enable fast
31
  inference and even training on CPUs.
 
36
 
37
  ---
38
 
39
+ <div align="center">
40
+
41
+ https://github.com/user-attachments/assets/88b98128-636e-43bc-a419-b1b1403c2055
42
+
43
+ [Bilibili Video](https://www.bilibili.com/video/BV12dHPeqE72/?share_source=copy_web&vd_source=670c2504f88726f8cf4a21ef6147c0e8)
44
+
45
+ </div>
46
+
47
  # 📌 Introduction
48
 
49
  In the field of large language models (LLMs) such as GPT, LLaMA, GLM, etc., while their performance is impressive, the
 
58
  Therefore, the goal of this project is to lower the barrier to entry for working with LLMs as much as possible, by
59
  training an extremely lightweight language model from scratch.
60
 
61
+ > [!CAUTION]
62
+ > As of 2024-09-17, MiniMind has trained three model versions, with the smallest model requiring only 26M (0.02B) parameters to achieve smooth conversational abilities!
63
 
64
+ | Model (Size) | Tokenizer Length | Inference Memory Usage | Release Date | Subjective Rating (/100) |
65
+ |-------------------------------|------------------|------------------------|--------------|--------------------------|
66
+ | minimind-v1-small (26M) | 6400 | 0.5 GB | 2024.08.28 | 50' |
67
+ | minimind-v1-moe (4×26M) | 6400 | 1.0 GB | 2024.09.17 | 55' |
68
+ | MiniMind-V1 (108M) | 6400 | 1.0 GB | 2024.09.01 | 60' |
 
69
 
70
+ > This analysis was run on an RTX 3090 GPU with Torch 2.1.2, CUDA 12.2, and Flash Attention 2.
71
 
72
  The project includes:
73
 
74
  - Public MiniMind model code (including Dense and MoE models), code for Pretrain, SFT instruction fine-tuning, LoRA
75
  fine-tuning, and DPO preference optimization, along with datasets and sources.
76
  - Compatibility with popular frameworks such as `transformers`, `accelerate`, `trl`, and `peft`.
77
+ - Training support for single-GPU and multi-GPU setups(DDP、DeepSpeed). The training process allows for stopping and
78
+ resuming at any
79
  point.
80
  - Code for testing the model on the Ceval dataset.
81
  - Implementation of a basic chat interface compatible with OpenAI's API, facilitating integration into third-party Chat
 
83
 
84
  We hope this open-source project helps LLM beginners get started quickly!
85
 
86
+ ### 👉**Recent Updates**
87
+ <details close>
88
+ <summary> <b>2024-09-17 (new🎉)</b> </summary>
89
+
90
+ - Updated the minimind-v1-moe model
91
+ - To prevent ambiguity, all mistral_tokenizer versions have been removed, and a custom minimind_tokenizer is now used as the tokenizer.
92
+
93
+ </details>
94
+
95
+ <details close>
96
+ <summary> <b>2024-09-01</b> </summary>
97
+
98
+ - Updated the MiniMind-V1 (108M) model, using minimind_tokenizer with 3 pre-training epochs and 10 SFT epochs for more thorough training and improved performance.
99
+
100
+ - The project has been deployed to ModelScope's Creative Space and can be experienced on the website:
101
+
102
+ - [ModelScope Online Experience](https://www.modelscope.cn/studios/gongjy/minimind)
103
+
104
+ </details>
105
 
106
  <details close>
107
+ <summary> <b>2024-08-27</b> </summary>
108
+
109
+ - The project was open-sourced for the first time.
110
+
111
  </details>
112
 
113
  # 📌 Environment
 
120
  * CUDA == 12.2
121
  * [requirements.txt](./requirements.txt)
122
 
123
+ # 📌 Quick Inference & Test
124
 
125
  <div align="center" style="font-size: 1.5em; font-weight: bold;">
126
  <img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" alt="Hugging Face Logo" style="vertical-align: middle; height: 30px;" />
127
  Hugging Face
128
 
129
+ [MiniMind (HuggingFace)](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5)
130
+
131
+ <img src="https://g.alicdn.com/sail-web/maas/1.15.0/static/modelscopeIcon.cd89353f.svg" alt="Hugging Face Logo" style="vertical-align: middle; height: 30px;" />
132
+
133
+ [MiniMind (ModelScope)](https://www.modelscope.cn/models/gongjy/MiniMind-V1)
134
+
135
  </div>
136
 
137
  ```bash
138
  # step 1
139
+ git clone https://huggingface.co/jingyaogong/minimind-v1
140
  ```
141
 
142
  ```bash
 
144
  python 2-eval.py
145
  ```
146
 
147
+ or you can run streamlit, launch a web page to chat with minimind-v1
148
+
149
+ ```bash
150
+ # or step 3, use streamlit
151
+ streamlit run fast_inference.py
152
+ ```
153
+
154
+ ![](./images/streamlit.png)
155
+
156
+
157
+ <div align="center">
158
+
159
+ The project has been deployed to ModelScope makerspace, where you can experience:
160
+
161
+ [ModelScope Online](https://www.modelscope.cn/studios/gongjy/minimind)
162
+
163
+
164
+ </div>
165
+
166
  # 📌 Quick Start
167
 
168
+ *
169
+ 0. Install the required dependencies
170
+ ```bash
171
+ pip install -r requirements.txt
172
+ ```
173
+
174
  *
175
  1. Clone the project code
176
 
 
181
  *
182
  2. If you need to train the model yourself
183
 
184
+ * 2.1 Download the [dataset download link](#dataset-download-links) and place it in the `./dataset` directory.
185
 
186
  * 2.2 Run `python data_process.py` to process the dataset, such as token-encoding pretrain data and extracting QA
187
  data to CSV files for the SFT dataset.
 
195
  *
196
  3. Test model inference performance
197
 
198
+ * Ensure that the required trained parameter weights are located in the `./out/` directory.
199
+ * You can also directly download and use the trained model weights from [Trained Model Weights](#Trained Model Weights).
200
  ```text
201
  out
202
  ├── multi_chat
 
203
  │   ├── full_sft_512.pth
204
+ │   ├── full_sft_512_moe.pth
205
+ │   └── full_sft_768.pth
206
  ├── single_chat
 
207
  │   ├── full_sft_512.pth
208
+ │   ├── full_sft_512_moe.pth
209
+ │   └── full_sft_768.pth
210
+ ├── pretrain_768.pth
211
+ ├── pretrain_512_moe.pth
212
+ ├── pretrain_512.pth
 
 
 
 
213
  ```
214
 
215
  * Test the pretraining model's chain effect with `python 0-eval_pretrain.py`
 
218
 
219
  🍭 **Tip**: Pretraining and full parameter fine-tuning (`pretrain` and `full_sft`) support DDP multi-GPU acceleration.
220
 
221
+ * Start training on a single machine with N GPUs(DDP)
222
+ ```bash
 
223
  torchrun --nproc_per_node N 1-pretrain.py
224
+ # and
 
 
225
  torchrun --nproc_per_node N 3-full_sft.py
226
  ```
227
+ * Start training on a single machine with N GPUs(DeepSpeed)
228
+ ```bash
229
+ deepspeed --master_port 29500 --num_gpus=N 1-pretrain.py
230
+ # and
231
+ deepspeed --master_port 29500 --num_gpus=N 3-full_sft.py
232
+ ```
233
 
234
  # 📌 Data sources
235
 
 
250
 
251
  Powerful open-source models like 01万物, 千问, chatglm, mistral, and Llama3 have the following tokenizer vocabulary
252
  sizes:
253
+ <table>
254
+ <tr><th>Tokenizer Model</th><th>Vocabulary Size</th><th>Come from</th></tr>
255
+ <tr><td>yi tokenizer</td><td>64,000</td><td>01-AI(China)</td></tr>
256
+ <tr><td>qwen2 tokenizer</td><td>151,643</td><td>Alibaba Cloud(China)</td></tr>
257
+ <tr><td>glm tokenizer</td><td>151,329</td><td>Zhipu AI(China)</td></tr>
258
+ <tr><td>mistral tokenizer</td><td>32,000</td><td>Mistral AIChina)</td></tr>
259
+ <tr><td>llama3 tokenizer</td><td>128,000</td><td>Meta(China)</td></tr>
260
+ <tr><td>minimind tokenizer</td><td>6,400</td><td>Custom</td></tr>
261
+ </table>
262
+
263
+ > [!IMPORTANT]
264
+ > Update on 2024-09-17: To avoid ambiguity from previous versions and control the model size, all Minimind models now use the Minimind_tokenizer for tokenization, and all versions of the Mistral_tokenizer have been deprecated.
265
+
266
+ > Although the Minimind_tokenizer has a small length and its encoding/decoding efficiency is weaker compared to Chinese-friendly tokenizers like Qwen2 and GLM, the Minimind models have opted for their custom-trained Minimind_tokenizer to maintain a lightweight parameter structure and prevent an imbalance between encoding and computation layers. This is because the Minimind vocabulary size is only 6,400.
267
+ > Moreover, Minimind has not encountered any issues with decoding rare words in practical tests, and the performance has been satisfactory. Due to the custom vocabulary being compressed to 6,400 tokens, the total parameter size of the LLM is minimized to only 26M.
 
 
268
 
269
  ---
270
 
271
  - 📙 **[Pretrain Data](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md)**:
272
+ The [Seq-Monkey General Text Dataset](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md) / [Baidu](https://pan.baidu.com/s/114F1k3eksiWCOQLvaT3RYQ?pwd=6666)
273
  is a collection of data from various public sources such as websites, encyclopedias, blogs, open-source code, books,
274
  etc. It has been compiled, cleaned, and organized into a unified JSONL format, with rigorous filtering and
275
  deduplication to ensure data comprehensiveness, scale, reliability, and high quality. The total amount is
 
309
 
310
  ### Dataset Download Links
311
 
312
+ | MiniMind Training Dataset | Download Link |
313
+ |---------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
314
+ | **[tokenizer Data]** | [HuggingFace](https://huggingface.co/datasets/jingyaogong/minimind_dataset/tree/main) / [Baidu](https://pan.baidu.com/s/1yAw1LVTftuhQGAC1Y9RdYQ?pwd=6666) |
315
+ | **[Pretrain Data]** | [Seq-Monkey General Text Dataset](http://share.mobvoi.com:5000/sharing/O91blwPkY) / [Baidu](https://pan.baidu.com/s/114F1k3eksiWCOQLvaT3RYQ?pwd=6666) |
316
+ | **[SFT Data]** | [Jiangshu Large Model SFT Dataset](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data/resolve/master/sft_data_zh.jsonl) |
317
+ | **[DPO Data]** | [Huozi Dataset 1](https://huggingface.co/datasets/Skepsun/huozi_rlhf_data_json) |
318
+ | **[DPO Data]** | [Huozi Dataset 2](https://huggingface.co/datasets/beyond/rlhf-reward-single-round-trans_chinese) |
319
 
320
  # 📌 Model
321
 
 
346
  Model configurations can be found in [./model/LMConfig.py](./model/LMConfig.py). The model types and parameters are
347
  shown in the table below:
348
 
349
+ | Model Name | params | len_vocab | n_layers | d_model | kv_heads | q_heads | share+route | TopK |
350
+ |------------------|--------|-----------|----------|---------|----------|---------|-------------|------|
351
+ | minimind-v1-small | 26M | 6400 | 8 | 512 | 8 | 16 | - | - |
352
+ | minimind-v1-moe | 4×26M | 6400 | 8 | 512 | 8 | 16 | 2+4 | 2 |
353
+ | minimind-v1 | 108M | 6400 | 16 | 768 | 8 | 16 | - | - |
 
354
 
 
 
355
 
356
  # 📌 Experiment
357
 
 
362
  Environment: python 3.9 + Torch 2.1.2 + DDP multi-GPU training
363
  ```
364
 
365
+ | Model Name | params | len_vocab | batch_size | pretrain_time | sft_single_time | sft_multi_time |
366
+ |------------------|--------|-----------|------------|-------------------|-------------------|---------------------|
367
+ | minimind-v1-small | 26M | 6400 | 64 | ≈2 hour (1 epoch) | ≈2 hour (1 epoch) | ≈0.5 hour (1 epoch) |
368
+ | minimind-v1-moe | 4×26M | 6400 | 40 | ≈6 hour (1 epoch) | ≈5 hour (1 epoch) | ≈1 hour (1 epoch) |
369
+ | minimind-v1 | 108M | 6400 | 16 | ≈6 hour (1 epoch) | ≈4 hour (1 epoch) | ≈1 hour (1 epoch) |
 
370
 
371
  ---
372
 
 
428
  ```bash
429
  python 5-dpo_train.py
430
  ```
 
 
 
 
 
 
 
 
 
 
 
431
  ---
432
+ 📋 Regarding LLM parameter configuration, an interesting paper [MobileLLM](https://arxiv.org/pdf/2402.14905) provides detailed research and experiments.
433
+ The scaling law exhibits unique patterns in small models. The parameters that significantly influence the scaling of Transformer models are primarily `d_model` and `n_layers`.
434
 
435
+ * `d_model`↑ + `n_layers`↓ -> Short and wide models
436
+ * `d_model`↓ + `n_layers`↑ -> Tall and narrow models
 
 
 
 
437
 
438
+ The Scaling Law proposed in 2020 posits that the amount of training data, parameter count, and training iterations are the key factors determining performance, with the influence of model architecture being nearly negligible. However, this law seems not to fully apply to small models.
439
+ MobileLLM suggests that the depth of the architecture is more important than its width. A "deep and narrow" model can learn more abstract concepts compared to a "wide and shallow" model. For instance, when the model parameters are fixed at 125M or 350M, a 30–42 layer "narrow" model significantly outperforms a 12-layer "short and wide" model. This trend is observed across eight benchmark tests, including common sense reasoning, question answering, and reading comprehension.
440
+ This is a fascinating discovery, as previously, few attempts were made to stack more than 12 layers when designing architectures for small models around the 100M parameter range. This aligns with the observations from MiniMind, where adjusting parameters between `d_model` and `n_layers` during training produced similar effects.
441
+ However, "deep and narrow" has its limitations. When `d_model` < 512, the disadvantages of collapsing word embedding dimensions become very pronounced, and increasing layers does not compensate for the shortcomings in `d_head` caused by fixed `q_head`. Conversely, when `d_model` > 1536, increasing layers seems to have a higher priority than `d_model`, providing a better "cost-performance" ratio and effect gain.
442
+ Therefore, MiniMind sets `d_model = 512` and `n_layers = 8` for the small model to achieve a balance between "minimal size <-> better performance." For greater performance gains, `d_model = 768` and `n_layers = 16` are set, aligning better with the scaling law for small models.
443
 
444
+ > For reference, the configuration details for GPT-3 are shown in the table below:
 
 
445
 
446
+ ![gpt3_config.png](./images/gpt3_config.png)
 
 
 
 
447
 
448
+ ---
449
+ ### Trained Model Weights
450
 
 
 
 
 
451
 
452
+ | Model Name | params | Config | pretrain_model | single_sft_model | multi_sft_model |
453
+ |-------------------|--------|-----------------------------|----------------|-----------------------------------------------------------------|----------------------------------------------------------------|
454
+ | minimind-v1-small | 26M | d_model=512<br/>n_layers=8 | - | [URL](https://pan.baidu.com/s/1_COe0FQRDmeapSsvArahCA?pwd=6666) | [URL](https://pan.baidu.com/s/1GsGsWSL0Dckl0YPRXiBIFQ?pwd=6666) |
455
+ | minimind-v1-moe | 4×26M | d_model=512<br/>n_layers=8 | - | - | - |
456
+ | minimind-v1 | 108M | d_model=768<br/>n_layers=16 | - | [URL](https://pan.baidu.com/s/1p713loS7EfwHQf3G9eYI3Q?pwd=6666) | [URL](https://pan.baidu.com/s/12iHGpAs6R0kqsOnGtgK6vQ?pwd=6666) |
457
 
458
  ---
459
 
460
+
461
  # 📌 Eval
462
 
463
+ > [!TIP]
464
+ > The following tests were completed on September 17, 2024. New models released after this date will not be included in the tests unless there is a special need.
 
 
 
 
465
 
466
+ [A] [minimind-v1-small(0.02B)](https://pan.baidu.com/s/1_COe0FQRDmeapSsvArahCA?pwd=6666)<br/>
467
+ [B] minimind-v1-moe(0.1B)<br/>
468
+ [C] [minimind-v1(0.1B)](https://pan.baidu.com/s/1p713loS7EfwHQf3G9eYI3Q?pwd=6666)<br/>
469
+ [D] [baby-llama2-chinese(0.2B)](https://github.com/DLLXW/baby-llama2-chinese)<br/>
470
+ [E] [chatlm-mini-chinese(0.2B)](https://github.com/charent/ChatLM-mini-Chinese)<br/>
471
+
472
+ ```text
473
+ [Q]: Do you know the Yangtze River?
474
+ [A]: Yes, the Yangtze River is the longest river in China, with a length of over 3,300 kilometers.
475
+ [B]: Yes, the Yangtze River is the longest river in China, with a length of about 7,100 kilometers. Its drainage area is about 33.28 million square kilometers, making it one of China's largest rivers and the third longest river in the world, with a total length of about 6,300 kilometers.
476
+ [C]: Yes, the Yangtze River is one of the longest rivers in China and the third longest river in the world. It originates from the Tibetan Plateau, flows through 11 provinces, and eventually empties into the East China Sea. The Yangtze River Basin is one of China's largest river basins, with abundant natural resources and cultural heritage, such as the Yangtze River Bridge, Yangtze River Maritime Museum, and ancient cities along the Yangtze.
477
+ [D]: I also know that China is one of the longest cities in the world, with a population of over 34 million people.
478
+ [E]: Yes, the Yangtze River is the longest river in China and also one of the longest rivers in the world. It originates from the Tibetan Plateau, flows through nine provinces and regions, and eventually empties into the East China Sea. The climate in the Yangtze River Basin is mild, with distinct seasons, making it an important agricultural and industrial base in China. The river basin is rich in aquatic resources and is one of China's major fishing resources.
 
 
 
479
 
480
  [Q]: Which country launched the first artificial satellite?
481
  [A]: The first artificial satellite was Sputnik 1.
482
  [B]: The first artificial satellite was launched by the United States.
483
+ [C]: The first artificial satellite was launched by the Soviet Union on October 4, 1957.
484
+ [D]: This satellite, named "Armstrong," was launched on November 20, 1961, and was the first successful satellite to send a human into space.
485
+ [E]: The first artificial satellite was launched by the Soviet Union on April 4, 1957. It was launched by Soviet astronaut Yuri Gagarin and Soviet engineer Roald Skohley. During the launch, Gagarin and Skohley used a small satellite named "Sputnik 1," which had a diameter of 4.3 meters and weighed 1,051 kilograms. This satellite, known as "Sputnik 1," marked a milestone in the Soviet space race and was the first instance of a human spacewalk in history.
 
486
 
487
  [Q]: Do you know where Mount Tai is located?
488
+ [A]: Mount Tai is located in eastern China and is one of the Five Great Mountains of China.
489
+ [B]: Mount Tai is located in Tai'an City, Shandong Province, and is one of the Five Great Mountains of China.
490
+ [C]: Mount Tai is located in Tai'an City, Shandong Province, China, and is one of the Five Great Mountains of China.
491
+ [D]: As an AI assistant, I cannot access the internet, so I cannot provide information about your location. However, you can search for information or related keywords to find details about your location, or check news and websites for relevant information.
492
+ [E]: Mount Tai is located in Tai'an City, Shandong Province, China, and is one of the Five Great Mountains of China. It is also an important symbol of Chinese culture and history. With an elevation of 1,545 meters, it is one of the most famous tourist destinations in China. Its history dates back to the 5th century BC and was a site for ancient royal sacrifices. Mount Tai is a significant part of Chinese culture, known as the "First of the Five Great Mountains."
 
493
 
494
  [Q]: What is the highest mountain in the world?
495
+ [A]: The highest mountain in the world is Mount Everest, located on the border between Nepal and China, with an elevation of 8,848 meters.
496
+ [B]: Mount Everest is the highest mountain in the world, located in the Himalayas, with an elevation of 8,848 meters.
497
+ [C]: Mount Everest is the highest mountain in the world, with an elevation of 8,848 meters.
498
+ [D]: The highest mountain in the world is Mount Everest, located in the Himalayas, with an elevation of 8,848.86 meters (29,031.69 feet).
499
+ [E]: The highest mountain in the world is Mount Everest, also known as Sagarmatha, located on the border between Nepal and China in the Himalayas, with an elevation of 8,848.86 meters (29,031.69 feet). It is one of the most famous climbing destinations, attracting climbers from around the world. The name Everest comes from Tibetan, meaning "The Goddess’s Peak," and it consists of two peaks, one in Nepal and one in China. Its height is determined through measurements of sea level, as it is the highest point on Earth. Although Everest is the highest mountain, it is not accessible from every place on Earth. Other places, such as the Andes Mountains in South America, are also high but do not have the highest peak. Overall, Everest is the highest mountain, but it is not accessible from every location on Earth.
 
500
 
501
  [Q]: Who proposed the law of universal gravitation?
502
+ [A]: The law of universal gravitation was proposed by the British physicist Newton. He believed that all objects exert gravitational force, which depends on their mass and distance. This theory reveals the relationship between mass and distance and explains the nature of gravitation.
503
+ [B]: The law of universal gravitation was discovered by Newton and proposed by the British physicist James Watt in 1879.
504
+ [C]: The law of universal gravitation was proposed by the British physicist Isaac Newton in the 17th century.
505
+ [D]: Gravitation is a force in the universe composed of mass or energy. According to Newton's law of universal gravitation, the force between two objects is proportional to their mass and inversely proportional to their distance. Thus, Newton's law of universal gravitation shows that the gravitational force between two objects is proportional to their mass and inversely proportional to their distance. This law has wide applications in physics, astronomy, and other fields.
506
+ [E]: The law of universal gravitation was proposed by the British physicist Isaac Newton in 1687. In his famous paper "Philosophiæ Naturalis Principia Mathematica," Newton proposed the law of universal gravitation, stating that any two objects exert a gravitational force on each other, proportional to their masses and inversely proportional to the square of their distance. This law is known as the law of universal gravitation and describes the relationship between mass and distance. It is crucial for explaining many phenomena in the universe, such as planetary orbits and gravitational lensing between stars, and has broad applications in astronomy, physics, and engineering, providing a fundamental theoretical basis for understanding the nature and evolution of the universe.
 
507
 
508
  [Q]: Do you know what the main food of giant pandas is?
509
+ [A]: The main food of giant pandas is bamboo. They typically eat bamboo and need to consume about 30 kilograms of it each day.
510
+ [B]: Giant pandas mainly eat bamboo, which is their primary food. Their digestive system is quite specialized, and they need to digest a large amount of bamboo daily to meet their survival needs. Besides bamboo, giant pandas also survive in the wild.
511
+ [C]: Giant pandas mainly eat bamboo but occasionally consume other plants and animals. They also eat some wild fruits, small animals, and plants, but these are not their main food sources.
512
+ [D]: The main food of giant pandas is bamboo.
513
+ [E]: Giant pandas are herbivores that primarily eat bamboo. Their digestive system is adapted to process the high cellulose content of bamboo, but they also consume some other plants, such as herbaceous plants and small mammals. Their digestive system includes enzymes like "cellulase" to break down cellulose, helping them digest a large amount of fiber and nutrients. Additionally, they have "cellulose protein" that helps maintain their energy levels, aiding their survival in the wild. Overall, bamboo is their primary food, but they also eat other plants and small mammals.
 
514
 
515
  [Q]: Why is seawater salty?
516
+ [A]: Seawater contains many salts, which increase its salinity. The salts in seawater mainly come from the dissolved salts and form hydrogen bonds with water molecules, increasing the salinity.
517
+ [B]: Seawater contains abundant elements like calcium, phosphorus, potassium, and phosphorus, which are limited in quantity and distribution in the ocean. Additionally, seawater contains nutrients from various organisms and microorganisms, such as shells and seaweeds.
518
+ [C]: Seawater contains many dissolved salts, with the most common being sodium chloride. When salt is added to water, its molecules react with water molecules to form sodium hydroxide, which is a type of salt.
519
+ [D]: Water is essential for many organisms, as it can absorb and release carbon dioxide, release oxygen,
 
 
520
  ```
521
 
522
+ > [!NOTE]
523
+ > 🙋‍♂️Directly throw the answer of the above model to GPT-4o and ask it to help score it:
524
 
525
  ---
526
+ ### Model Performance Review:
527
 
528
+ 1. **Model A**:
529
+ - **Performance**: Model A's responses are usually concise and clear but lack detail and accuracy in some cases. For example, Model A provided incorrect information about the length of the Yangtze River.
530
+ - **Score**: 60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
531
 
532
+ 2. **Model B**:
533
+ - **Performance**: Model B provides additional information in some cases, but this information can sometimes be inaccurate or excessive. For instance, Model B gave incorrect figures for the length and drainage area of the Yangtze River.
534
+ - **Score**: 65
 
535
 
536
+ 3. **Model C**:
537
+ - **Performance**: Model C typically provides detailed and accurate answers for most questions. For example, responses about the Yangtze River and Mount Tai were accurate.
538
+ - **Score**: 75
539
 
540
+ 4. **Model D**:
541
+ - **Performance**: Model D’s responses sometimes appear disorganized and lack accuracy. For example, the answer about Mount Tai was completely off-topic.
542
+ - **Score**: 50
 
543
 
544
+ 5. **Model E**:
545
+ - **Performance**: Model E’s responses are usually very detailed, but they can be overly verbose and contain unnecessary information. For instance, the answer on gravity was overly complex.
546
+ - **Score**: 70
547
 
548
+ #### Ranking (from highest to lowest):
 
 
 
 
 
549
 
550
+ | Model | C | E | B | A | D |
551
+ |-------|----|----|----|----|----|
552
+ | Score | 75 | 70 | 65 | 60 | 50 |
553
 
554
  ---
555
 
556
+ ## 👉 Summary of Effects
557
 
558
+ * The ranking of the minimind series (ABC) is intuitive, with minimind-v1(0.1B) scoring the highest and providing mostly accurate answers to common knowledge questions.
559
+ * Surprisingly, minimind-v1-small (0.02B) with only 26M parameters performs close to minimind-v1(0.1B).
560
+ * Despite having less than 2 epochs of training, minimind-v1(0.1B) performed the best. This suggests that a larger model often yields better performance, even with limited training.
561
+ * minimind-v1-moe (0.1B) performed poorly, likely because it was terminated early to free up resources for smaller models. MoE models require more training epochs, and with only 2 epochs, it was under-trained. Previous experiments with a fully trained MoE model on Yi tokenizer showed visible improvements. Future versions, v2 and v3, will be updated with better training.
 
 
 
 
 
 
 
 
562
 
563
+ * Model E’s responses appear the most complete, despite some instances of hallucination and overly verbose content. However, GPT-4o and Deepseek's evaluations suggest it is "overly verbose and repetitive, with some hallucinations."
564
+ This strict evaluation might penalize models with some hallucinations heavily. Due to F models having longer default text lengths and much larger datasets, the quality of responses depends significantly on the data rather than the model size alone.
 
 
 
565
 
566
+ > 🙋‍♂️ Personal Subjective Evaluation: E>C>B≈A>D
567
 
568
+ > 🤖 GPT-4o Evaluation: C>E>B>A>D
569
 
570
+ Scaling Law: Larger model parameters and more training data generally lead to better model performance.
 
571
 
572
  # 📌 Objective Dataset: C-Eval
573
 
 
578
  against the standard answer. Note that minimind models were not trained on larger datasets or fine-tuned for question
579
  answering, so results should be considered as reference only.
580
 
581
+ > For example, detailed results for minimind-small:
582
+
583
+ | Type | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 |
584
+ |------|----------------------------|-----|-----------------------|-----------------------|---------------------|--------------------|---------------------|---------------------|----------------|------------------------|-----------------------|-----------------------|----------------|------------------|-------|---------------------|---------------|---------------------------------|---------------------|------------|------------------|-------------------------|--------------------|---------------------|---------|----------------------|-------------------------|-------------------------|--------------------|-----------------------------------|-------------------|-------------------------|------------------------------------------|-----------------------|-------------------------|-----------------|---------------------------|----------------------|-----------|-------------------|---------------------|-----------------------|------------------------|-------------------|------------------|----------------|-------------|-----------------------|----------------------|-------------------|---------------|-------------------------|
585
+ | Data | probability_and_statistics | law | middle_school_biology | high_school_chemistry | high_school_physics | legal_professional | high_school_chinese | high_school_history | tax_accountant | modern_chinese_history | middle_school_physics | middle_school_history | basic_medicine | operating_system | logic | electrical_engineer | civil_servant | chinese_language_and_literature | college_programming | accountant | plant_protection | middle_school_chemistry | metrology_engineer | veterinary_medicine | marxism | advanced_mathematics | high_school_mathematics | business_administration | mao_zedong_thought | ideological_and_moral_cultivation | college_economics | professional_tour_guide | environmental_impact_assessment_engineer | computer_architecture | urban_and_rural_planner | college_physics | middle_school_mathematics | high_school_politics | physician | college_chemistry | high_school_biology | high_school_geography | middle_school_politics | clinical_medicine | computer_network | sports_science | art_studies | teacher_qualification | discrete_mathematics | education_science | fire_engineer | middle_school_geography |
586
+
587
+ | Type | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 |
588
+ |----------|--------|--------|--------|--------|--------|-------|--------|--------|--------|--------|--------|--------|-------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|-------|
589
+ | T/A | 3/18 | 5/24 | 4/21 | 7/19 | 5/19 | 2/23 | 4/19 | 6/20 | 10/49 | 4/23 | 4/19 | 4/22 | 1/19 | 3/19 | 4/22 | 7/37 | 11/47 | 5/23 | 10/37 | 9/49 | 7/22 | 4/20 | 3/24 | 6/23 | 5/19 | 5/19 | 4/18 | 8/33 | 8/24 | 5/19 | 17/55 | 10/29 | 7/31 | 6/21 | 11/46 | 5/19 | 3/19 | 4/19 | 13/49 | 3/24 | 5/19 | 4/19 | 6/21 | 6/22 | 2/19 | 2/19 | 14/33 | 12/44 | 6/16 | 7/29 | 9/31 | 1/12 |
590
+ | Accuracy | 16.67% | 20.83% | 19.05% | 36.84% | 26.32% | 8.70% | 21.05% | 30.00% | 20.41% | 17.39% | 21.05% | 18.18% | 5.26% | 15.79% | 18.18% | 18.92% | 23.40% | 21.74% | 27.03% | 18.37% | 31.82% | 20.00% | 12.50% | 26.09% | 26.32% | 26.32% | 22.22% | 24.24% | 33.33% | 26.32% | 30.91% | 34.48% | 22.58% | 28.57% | 23.91% | 26.32% | 15.79% | 21.05% | 26.53% | 12.50% | 26.32% | 21.05% | 28.57% | 27.27% | 10.53% | 10.53% | 42.42% | 27.27% | 37.50% | 24.14% | 29.03% | 8.33% |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
591
 
592
  **Total number of questions**: 1346
593
 
 
599
 
600
  #### Results summary:
601
 
602
+ | category | correct | question_count | accuracy |
603
+ |:------------------|:--------:|:--------------:|:--------:|
604
+ | minimind-v1-small | 344 | 1346 | 25.56% |
605
+ | minimind-v1 | 351 | 1346 | 26.08% |
606
+
 
607
 
608
  ### Model Performance Insights from GPT-4o
609
 
 
629
  This suggests that the model performs well in logical reasoning, foundational sciences, and some engineering disciplines but is weaker in humanities, social sciences, and certain specialized fields (such as law and taxation). To improve the model's performance, additional training in humanities, physics, law, and environmental science may be beneficial.
630
  ```
631
 
 
632
  # 📌 Others
633
 
634
  ### Inference and Export
635
 
636
  * [./export_model.py](./export_model.py) can export the model to the transformers format and push it to Hugging Face.
637
 
638
+ * MiniMind's Hugging Face collection
639
+ address: [MiniMind](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5)
640
 
641
  ---
642
 
643
  ### API Inference
644
 
645
+ [./my_openai_api.py](./my_openai_api.py) provides a chat interface for the OpenAI API, making it easier to integrate
646
+ your model with third-party UIs, such as fastgpt, OpenWebUI, etc.
647
 
648
+ * Download the model weight files
649
+ from [Hugging Face](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5):
650
  ```
651
  minimind (root dir)
652
  ├─minimind
 
692
 
693
  ---
694
 
695
+ # 📌 Acknowledgement
696
 
697
+ > [!NOTE]
698
+ > If you find `MiniMind` helpful, please give us a ⭐️ on GitHub. Your support is the driving force behind our continuous
699
+ > efforts to improve the project! Due to the length and limited expertise, there may be some errors. We welcome any
700
+ > issues
701
+ > for discussion and correction.
702
 
703
+ ## 🤝[Contributors](https://github.com/jingyaogong/minimind/graphs/contributors)
704
 
705
+ <!--
706
+ <a href="https://github.com/jingyaogong/minimind/graphs/contributors">
707
+ <img src="https://contrib.rocks/image?repo=jingyaogong/minimind&v3" />
708
+ </a>
709
+ -->
710
 
711
+ <a href="https://github.com/jingyaogong"><img src="https://avatars.githubusercontent.com/u/62287848" width="70px" height="70px"/></a>&nbsp;
712
+ <a href="https://github.com/MuWinds"><img src="https://avatars.githubusercontent.com/u/93832089" width="70px" height="70px"/></a>&nbsp;
713
+ <a href="https://github.com/chuanzhubin"><img src="https://avatars.githubusercontent.com/u/2813798" width="70px" height="70px"/></a>&nbsp;
714
 
715
 
716
+ ## 😊Thanks for
 
 
 
717
 
718
+ <a href="https://github.com/ipfgao"><b>@ipfgao</b></a>:
719
+ <a href="https://github.com/jingyaogong/minimind/issues/26">🔗训练步骤记录</a>
720
 
721
+ ## 🫶Supporter
722
 
723
+ <a href="https://github.com/jingyaogong/minimind/stargazers">
724
+ <picture>
725
+ <source media="(prefers-color-scheme: dark)" srcset="https://reporoster.com/stars/dark/jingyaogong/minimind"/>
726
+ <source media="(prefers-color-scheme: light)" srcset="https://reporoster.com/stars/jingyaogong/minimind"/>
727
+ <img alt="github contribution grid snake animation" src="https://reporoster.com/stars/jingyaogong/minimind"/>
728
+ </picture>
729
+ </a>
730
 
731
+ <a href="https://github.com/jingyaogong/minimind/network/members">
732
+ <picture>
733
+ <source media="(prefers-color-scheme: dark)" srcset="https://reporoster.com/forks/dark/jingyaogong/minimind"/>
734
+ <source media="(prefers-color-scheme: light)" srcset="https://reporoster.com/forks/jingyaogong/minimind"/>
735
+ <img alt="github contribution grid snake animation" src="https://reporoster.com/forks/jingyaogong/minimind"/>
736
+ </picture>
737
+ </a>
738
 
739
+ <picture>
740
+ <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=jingyaogong/minimind&type=Date&theme=dark"/>
741
+ <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=jingyaogong/minimind&type=Date"/>
742
+ <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=jingyaogong/minimind&type=Date"/>
743
+ </picture>
744
 
745
+ # License
746
 
747
  This repository is licensed under the [Apache-2.0 License](LICENSE).