instruction-pretrain
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -15,6 +15,7 @@ We explore supervised multitask pre-training by proposing ***Instruction Pre-Tra
|
|
15 |
</p>
|
16 |
|
17 |
**************************** **Updates** ****************************
|
|
|
18 |
* 2024/7/15: We scaled up the pre-trained tokens from 100B to 250B, with the number of synthesized instruction-response pairs reaching 500M! Below, we show the performance trend on downstream tasks throughout the pre-training process:
|
19 |
<p align='left'>
|
20 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0okCfRkC6uALTfuNxt0Fa.png" width="500">
|
@@ -22,8 +23,9 @@ We explore supervised multitask pre-training by proposing ***Instruction Pre-Tra
|
|
22 |
* 2024/6/21: Released the [paper](https://huggingface.co/papers/2406.14491), [code](https://github.com/microsoft/LMOps), and [resources](https://huggingface.co/instruction-pretrain)
|
23 |
|
24 |
## Resources
|
25 |
-
**π€ We share our data and models with example usages, feel free to open any issues or discussions! π€**
|
26 |
|
|
|
27 |
- Context-Based Instruction Synthesizer: [instruction-synthesizer](https://huggingface.co/instruction-pretrain/instruction-synthesizer)
|
28 |
- Fine-Tuning Data for the Synthesizer: [ft-instruction-synthesizer-collection](https://huggingface.co/datasets/instruction-pretrain/ft-instruction-synthesizer-collection)
|
29 |
- General Models Pre-Trained from Scratch (on 100B tokes):
|
@@ -43,9 +45,12 @@ We conduct multitask fine-tuning on a language model to develop an instruction s
|
|
43 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0889QyG59QM3rPeZlcTzZ.png" width="700">
|
44 |
</p>
|
45 |
|
46 |
-
### Basic Usage: Synthesize instruction-response pairs based on a given raw text
|
47 |
|
48 |
**π Here is an amazing demo that implements our approach: [davanstrien/instruction-synthesizer](https://huggingface.co/spaces/davanstrien/instruction-synthesizer) π**
|
|
|
|
|
|
|
49 |
```python
|
50 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
51 |
|
@@ -98,11 +103,15 @@ print(f'# Context:\n{context}\n')
|
|
98 |
for index, pair in enumerate(instruction_response_pairs):
|
99 |
print(f'## Instruction {index + 1}:\n{pair["Q"]}\n## Response {index + 1}:\n{pair["A"]}\n')
|
100 |
```
|
|
|
101 |
|
102 |
-
### Advanced Usage: Convert Raw Corpora into Instruction-Augmented Corpora at Scale
|
103 |
We use vLLM to accelerate the synthesis process. On a single A100-80GB GPU, it takes about 2 days to synthesize instruction-response pairs for 1 billion tokens of raw corpora.
|
104 |
|
105 |
-
|
|
|
|
|
|
|
106 |
|
107 |
```bash
|
108 |
git clone https://github.com/microsoft/LMOps.git
|
@@ -115,7 +124,7 @@ Install vLLM with pip or from [source](https://vllm.readthedocs.io/en/latest/get
|
|
115 |
pip install vllm
|
116 |
```
|
117 |
|
118 |
-
2. Synthesize and Templify Few-shot Examples for Pre-Training
|
119 |
|
120 |
A one-shot example consists of a piece of raw text followed by its instruction-response pairs. We conduct multi-round inferece to synthesize few-shot examples: the instruction-response pairs of different raw texts share the same pattern.
|
121 |
|
@@ -168,13 +177,20 @@ for idx, text in enumerate(instruction_augmented_texts):
|
|
168 |
|
169 |
# Now you can use `instruction_augmented_texts` for pre-training!
|
170 |
```
|
|
|
171 |
|
172 |
**Pre-Training Suggestions:**
|
173 |
|
174 |
-
Except for the pre-training data, *Instruction Pre-Training* keeps all other
|
|
|
|
|
175 |
|
176 |
1. For general pre-training from scratch, we recommend setting M = 2 and mixing the instruction-augmented corpora with unchanged raw corpora.
|
177 |
-
2. For domain-adaptive continual pre-training, we recommend setting M = 3 and mixing the instruction-augmented corpora with general instructions from [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) at a 1:1 ratio (counted by tokens).
|
|
|
|
|
|
|
|
|
178 |
|
179 |
## Citation
|
180 |
If you find our work helpful, please cite us:
|
@@ -189,7 +205,7 @@ Instruction Pre-Training
|
|
189 |
}
|
190 |
```
|
191 |
|
192 |
-
[
|
193 |
```bibtex
|
194 |
@inproceedings{
|
195 |
cheng2024adapting,
|
|
|
15 |
</p>
|
16 |
|
17 |
**************************** **Updates** ****************************
|
18 |
+
* 2024/7/31: Updated pre-training suggestions in the `Advanced Usage` section of [instruction-synthesizer](https://huggingface.co/instruction-pretrain/instruction-synthesizer)
|
19 |
* 2024/7/15: We scaled up the pre-trained tokens from 100B to 250B, with the number of synthesized instruction-response pairs reaching 500M! Below, we show the performance trend on downstream tasks throughout the pre-training process:
|
20 |
<p align='left'>
|
21 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0okCfRkC6uALTfuNxt0Fa.png" width="500">
|
|
|
23 |
* 2024/6/21: Released the [paper](https://huggingface.co/papers/2406.14491), [code](https://github.com/microsoft/LMOps), and [resources](https://huggingface.co/instruction-pretrain)
|
24 |
|
25 |
## Resources
|
26 |
+
**π€ We share our data and models with example usages, feel free to open any issues or discussions at [this page](https://huggingface.co/papers/2406.14491)! π€**
|
27 |
|
28 |
+
- Thanks to the demo [davanstrien/instruction-synthesizer](https://huggingface.co/spaces/davanstrien/instruction-synthesizer) for implementing our approach
|
29 |
- Context-Based Instruction Synthesizer: [instruction-synthesizer](https://huggingface.co/instruction-pretrain/instruction-synthesizer)
|
30 |
- Fine-Tuning Data for the Synthesizer: [ft-instruction-synthesizer-collection](https://huggingface.co/datasets/instruction-pretrain/ft-instruction-synthesizer-collection)
|
31 |
- General Models Pre-Trained from Scratch (on 100B tokes):
|
|
|
45 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0889QyG59QM3rPeZlcTzZ.png" width="700">
|
46 |
</p>
|
47 |
|
48 |
+
### 1. Basic Usage: Synthesize instruction-response pairs based on a given raw text
|
49 |
|
50 |
**π Here is an amazing demo that implements our approach: [davanstrien/instruction-synthesizer](https://huggingface.co/spaces/davanstrien/instruction-synthesizer) π**
|
51 |
+
<details>
|
52 |
+
<summary> Click to expand </summary>
|
53 |
+
|
54 |
```python
|
55 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
56 |
|
|
|
103 |
for index, pair in enumerate(instruction_response_pairs):
|
104 |
print(f'## Instruction {index + 1}:\n{pair["Q"]}\n## Response {index + 1}:\n{pair["A"]}\n')
|
105 |
```
|
106 |
+
</details>
|
107 |
|
108 |
+
### 2. Advanced Usage: Convert Raw Corpora into Instruction-Augmented Corpora at Scale
|
109 |
We use vLLM to accelerate the synthesis process. On a single A100-80GB GPU, it takes about 2 days to synthesize instruction-response pairs for 1 billion tokens of raw corpora.
|
110 |
|
111 |
+
<details>
|
112 |
+
<summary> Click to expand </summary>
|
113 |
+
|
114 |
+
1). Set up dependencies:
|
115 |
|
116 |
```bash
|
117 |
git clone https://github.com/microsoft/LMOps.git
|
|
|
124 |
pip install vllm
|
125 |
```
|
126 |
|
127 |
+
2). Synthesize and Templify Few-shot Examples for Pre-Training
|
128 |
|
129 |
A one-shot example consists of a piece of raw text followed by its instruction-response pairs. We conduct multi-round inferece to synthesize few-shot examples: the instruction-response pairs of different raw texts share the same pattern.
|
130 |
|
|
|
177 |
|
178 |
# Now you can use `instruction_augmented_texts` for pre-training!
|
179 |
```
|
180 |
+
</details>
|
181 |
|
182 |
**Pre-Training Suggestions:**
|
183 |
|
184 |
+
Except for the pre-training data, *Instruction Pre-Training* keeps all other settings the same as *Vanilla Pre-Training*.
|
185 |
+
|
186 |
+
Therefore, you can easily use any training framework, such as [OLMo](https://github.com/allenai/OLMo) (for pre-training from scratch) and [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) (for continual pre-training), to train on the templified instruction-augmented corpora.
|
187 |
|
188 |
1. For general pre-training from scratch, we recommend setting M = 2 and mixing the instruction-augmented corpora with unchanged raw corpora.
|
189 |
+
2. For domain-adaptive continual pre-training, we recommend setting M = 3 and mixing the instruction-augmented corpora with general instructions from [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) at a 1:1 ratio (counted by tokens). Each example from OpenOrca is formulated as "{question} {response}", with a blank space used to connect the question and response.
|
190 |
+
|
191 |
+
Let's try our method in continual pre-training for a quick start---it works easily!
|
192 |
+
|
193 |
+
Feel free to ask for any suggestions at [this page](https://huggingface.co/papers/2406.14491); we will reply ASAPπ€!
|
194 |
|
195 |
## Citation
|
196 |
If you find our work helpful, please cite us:
|
|
|
205 |
}
|
206 |
```
|
207 |
|
208 |
+
[Adapt LLM to Domains](https://huggingface.co/papers/2309.09530)
|
209 |
```bibtex
|
210 |
@inproceedings{
|
211 |
cheng2024adapting,
|