Maxime
commited on
Create multi-node.md (#613)
Browse files* Create multi-node.md
* Update multi-node.md
* Update multi-node.md
- docs/multi-node.md +45 -0
docs/multi-node.md
ADDED
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Multi Node
|
2 |
+
|
3 |
+
You will need to create a configuration for accelerate, either by using `accelerate config` and follow the instructions or you can use one of the preset below:
|
4 |
+
|
5 |
+
~/.cache/huggingface/accelerate/default_config.yaml
|
6 |
+
```yaml
|
7 |
+
compute_environment: LOCAL_MACHINE
|
8 |
+
debug: false
|
9 |
+
distributed_type: FSDP
|
10 |
+
downcast_bf16: 'no'
|
11 |
+
machine_rank: 0 # Set to 0 for the main machine, increment by one for other machines
|
12 |
+
main_process_ip: 10.0.0.4 # Set to main machine's IP
|
13 |
+
main_process_port: 5000
|
14 |
+
main_training_function: main
|
15 |
+
mixed_precision: bf16
|
16 |
+
num_machines: 2 # Change to the number of machines
|
17 |
+
num_processes: 4 # That's the total number of GPUs, (for example: if you have 2 machines with 4 GPU, put 8)
|
18 |
+
rdzv_backend: static
|
19 |
+
same_network: true
|
20 |
+
tpu_env: []
|
21 |
+
tpu_use_cluster: false
|
22 |
+
tpu_use_sudo: false
|
23 |
+
use_cpu: false
|
24 |
+
```
|
25 |
+
|
26 |
+
Configure your model to use FSDP with for example:
|
27 |
+
```yaml
|
28 |
+
fsdp:
|
29 |
+
- full_shard
|
30 |
+
- auto_wrap
|
31 |
+
fsdp_config:
|
32 |
+
fsdp_offload_params: true
|
33 |
+
fsdp_state_dict_type: FULL_STATE_DICT
|
34 |
+
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
|
35 |
+
```
|
36 |
+
|
37 |
+
## Machine configuration
|
38 |
+
|
39 |
+
On each machine you need a copy of Axolotl, we suggest using the same commit to ensure compatibility.
|
40 |
+
|
41 |
+
You will also need to have the same configuration file for your model on each machine.
|
42 |
+
|
43 |
+
On the main machine only, make sure the port you set as `main_process_port` is open in TCP and reachable by other machines.
|
44 |
+
|
45 |
+
All you have to do now is launch using accelerate as you would usually do on each machine and voila, the processes will start once you have launched accelerate on every machine.
|