hamel commited on
Commit
ed70a08
·
unverified ·
1 Parent(s): 0cfdb2c

add docs for `input_output` format (#1367) [skip ci]

Browse files
Files changed (2) hide show
  1. README.md +9 -0
  2. docs/input_output.md +260 -0
README.md CHANGED
@@ -385,6 +385,15 @@ pretraining_dataset: # hf path only
385
 
386
  </details>
387
 
 
 
 
 
 
 
 
 
 
388
  ##### Conversation
389
 
390
  - `sharegpt`: conversations where `from` is `human`/`gpt`. (optional: first row with role `system` to override default system prompt)
 
385
 
386
  </details>
387
 
388
+ ##### Template-Free
389
+
390
+ - `input_output`: template-free prompt construction
391
+ ```json
392
+ {"segments": [{"label": true|false, "text": "..."}]}
393
+ ```
394
+
395
+ This is a special format that allows you to construct prompts without using templates. This is for advanced users who want more freedom with prompt construction. See [these docs](docs/input_output.md) for more details.
396
+
397
  ##### Conversation
398
 
399
  - `sharegpt`: conversations where `from` is `human`/`gpt`. (optional: first row with role `system` to override default system prompt)
docs/input_output.md ADDED
@@ -0,0 +1,260 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Template-free prompt construction with the `input_output` format
2
+
3
+ <!-- TOC -->
4
+
5
+ - [Background](#background)
6
+ - [Masking Inputs](#masking-inputs)
7
+ - [You may not want prompt templates](#you-may-not-want-prompt-templates)
8
+ - [The `input_output` format](#the-input_output-format)
9
+ - [Usage](#usage)
10
+ - [1. Prepare Data](#1-prepare-data)
11
+ - [2. Use `type: input_output`](#2-use-type-input_output)
12
+ - [3. Check the prompts](#3-check-the-prompts)
13
+
14
+ <!-- /TOC -->
15
+
16
+ <a id="markdown-background" name="background"></a>
17
+
18
+ ## Background
19
+
20
+ <a id="markdown-masking-inputs" name="masking-inputs"></a>
21
+
22
+ ### Masking Inputs
23
+
24
+ One of the most popular features of
25
+ [axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) is
26
+ setting the following configuration value:
27
+
28
+
29
+ ```yaml
30
+ train_on_inputs: false
31
+ ```
32
+
33
+ If you declare a [dataset formats](https://github.com/OpenAccess-AI-Collective/axolotl?tab=readme-ov-file#dataset)
34
+ such as `alpaca` or `chatml`, axolotl knows what is an input
35
+ (i.e. human) vs. an output (i.e. the assistant) and masks the input
36
+ labels so that your model can focus on predicting the outputs only.
37
+
38
+ <a id="markdown-you-may-not-want-prompt-templates" name="you-may-not-want-prompt-templates"></a>
39
+
40
+ ### You may not want prompt templates
41
+
42
+ However, there are many situations where you don't want to use one of
43
+ these formats or templates (I usually don't!). This is because they can:
44
+
45
+ - Add unnecessary boilerplate to your prompts.
46
+ - Create artifacts like special delimiters `<|im_start|>` that can
47
+ quickly become footguns if you don't include them correctly at
48
+ inference time.
49
+ - Enforce a *chat* interface when you do not want one. Sometimes you
50
+ just want to fine-tune a model to a very specific task and do NOT
51
+ want multi-turn conversations, roles, etc.
52
+ - Limit you to only certain roles that the template allows.
53
+
54
+ <a id="markdown-the-inputoutput-format" name="the-inputoutput-format"></a>
55
+
56
+ ### The `input_output` format
57
+
58
+ You can construct your prompts without a template by using the
59
+ `input_output` format, by setting `type: input_output` in your
60
+ configuration file like this:
61
+
62
+ **config.yml**
63
+
64
+ ```yaml
65
+ train_on_inputs: false # Mask segments of your data
66
+ datasets:
67
+ - path: output.jsonl
68
+ type: input_output # use template free prompt construction
69
+ ```
70
+
71
+ Unlike `type: completion`, which is also template-free,
72
+ `type: input_output` allows you to mask segments of your text. More
73
+ details on how this works are described below.
74
+
75
+ <a id="markdown-usage" name="usage"></a>
76
+
77
+ ## Usage
78
+
79
+ This is how you can use the `input_output` format:
80
+
81
+ <a id="markdown-1-prepare-data" name="1-prepare-data"></a>
82
+
83
+ ### 1. Prepare Data
84
+
85
+ To use the `input_output` format, collect your data in the following
86
+ format into a jsonl file (below is the first row from the file
87
+ `output`.jsonl` pretty printed):
88
+
89
+ ```bash
90
+ $ head -n1 output.jsonl | python -m json.tool
91
+
92
+ {.cell-output .cell-output-stdout}
93
+ {
94
+ "segments": [
95
+ {
96
+ "label": true,
97
+ "text": "<s>Hello\n"
98
+ },
99
+ {
100
+ "label": true,
101
+ "text": "hi there!. "
102
+ },
103
+ {
104
+ "label": false,
105
+ "text": "goodbye "
106
+ },
107
+ {
108
+ "label": true,
109
+ "text": "farewell</s>"
110
+ }
111
+ ]
112
+ }
113
+ ```
114
+
115
+ Set `label:false` when you want to mask a segment of text so that the
116
+ model isn't trained on it. Some things to keep in mind:
117
+
118
+ > [!IMPORTANT]
119
+ > 1. **EOS, BOS, spaces, newlines etc. are entirely up to you. Axolotl
120
+ concatenates all the segments as-is.** The tokenizer doesn't add
121
+ anything additional. Notice how I added spaces, newlines, `<s>`
122
+ (BOS), and `</s>` (EOS) myself.
123
+ > 2. Make sure you check the materialized output to validate that the
124
+ prompt is getting assembled how you like.
125
+
126
+ <a id="markdown-2-use-type-inputoutput" name="2-use-type-inputoutput"></a>
127
+
128
+ ### 2. Use `type: input_output`
129
+
130
+ Let's materialize data with our `output.jsonl` file by setting
131
+ `type: input_output` in our axolotl config:
132
+
133
+ ```yaml
134
+ # training_config.yaml
135
+ base_model: mistralai/Mistral-7B-v0.1
136
+ data_seed: 49
137
+ seed: 49
138
+
139
+ datasets:
140
+ - path: output.jsonl
141
+ type: input_output
142
+ val_set_size: 0.1
143
+
144
+ sequence_len: 896
145
+ sample_packing: false
146
+
147
+ micro_batch_size: 2
148
+ gradient_accumulation_steps: 3
149
+ eval_batch_size: 2
150
+ num_epochs: 1
151
+ learning_rate: 0.0002
152
+
153
+ train_on_inputs: false
154
+ special_tokens:
155
+ bos_token: "<s>"
156
+ eos_token: "</s>"
157
+ unk_token: "<unk>"
158
+ ```
159
+
160
+ You can use the following command to materialize your data. The
161
+ `--debug` flag will print the tokens, along with the labels so you can
162
+ verify that the correct items are being ignored:
163
+
164
+ ```bash
165
+ $ python -m axolotl.cli.preprocess training_config.yaml --debug
166
+
167
+ ...
168
+ [2024-03-05 23:36:46,969] [INFO] [axolotl.check_example_labels:35] [PID:607731] [RANK:0] <s>(1, 1) Hello(22557, 22557)
169
+ (13, 13) hi(12014, 12014) there(736, 736) !(28808, 28808) .(28723, 28723) (28705, 28705) good(-100, 1179) bye(-100, 17664) (-100, 28705) fare(19111, 19111) well(5458, 5458) </s>(2, 2)
170
+
171
+ ```
172
+
173
+ The format is `decoded_token`(`label`, `token_id`), for example,
174
+ `<s>(1, 1)` means that the token is `<s>`, the label is `1` and the
175
+ token_id is `1`. When the label is `-100` then that token is ignored for
176
+ training.
177
+
178
+ <a id="markdown-3-check-the-prompts" name="3-check-the-prompts"></a>
179
+
180
+ ### 3. Check the prompts
181
+
182
+ Here is another way to check the materialized output:
183
+
184
+ ```python
185
+ from transformers import AutoTokenizer
186
+ from datasets import load_from_disk
187
+ import yaml
188
+
189
+ directory = !ls last_run_prepared/
190
+ with open('training_config.yaml', 'r') as f:
191
+ cfg = yaml.safe_load(f)
192
+ model_id = cfg['base_model']
193
+ tok = AutoTokenizer.from_pretrained(model_id)
194
+ ds = load_from_disk(f'last_run_prepared/{directory[0]}/')
195
+ ```
196
+
197
+ ```python
198
+ >>> row = ds[0]
199
+ >>> print(tok.decode(row['input_ids']))
200
+ <s> Hello
201
+ hi there!. goodbye farewell</s>
202
+ ```
203
+
204
+ We can check that the right tokens are ingored by comparing the labels
205
+ to each token:
206
+
207
+ ```python
208
+ import pandas as pd
209
+ pd.DataFrame([{'token': tok.decode(i), 'label': l, 'id':i} for i,l in
210
+ zip(row['input_ids'], row['labels'])])
211
+ ```
212
+
213
+ | token | label | id |
214
+ |-------|-------|-------|
215
+ | 0 | \<s\> | 1 |
216
+ | 1 | Hello | 22557 |
217
+ | 2 | \\n | 13 |
218
+ | 3 | hi | 12014 |
219
+ | 4 | there | 736 |
220
+ | 5 | ! | 28808 |
221
+ | 6 | . | 28723 |
222
+ | 7 | | 28705 |
223
+ | 8 | good | -100 |
224
+ | 9 | bye | -100 |
225
+ | 10 | | -100 |
226
+ | 11 | fare | 19111 |
227
+ | 12 | well | 5458 |
228
+ | 13 | \</s\>| 2 |
229
+
230
+
231
+
232
+ If we look at the input data, the above table seems correct! (The jsonl
233
+ version is repeated below for reference):
234
+
235
+
236
+ ```bash
237
+ $ head -n1 output.jsonl | python -m json.tool
238
+
239
+ {.cell-output .cell-output-stdout}
240
+ {
241
+ "segments": [
242
+ {
243
+ "label": true,
244
+ "text": "<s>Hello\n"
245
+ },
246
+ {
247
+ "label": true,
248
+ "text": "hi there!. "
249
+ },
250
+ {
251
+ "label": false,
252
+ "text": "goodbye "
253
+ },
254
+ {
255
+ "label": true,
256
+ "text": "farewell</s>"
257
+ }
258
+ ]
259
+ }
260
+ ```