Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
## | |
Below is an example yaml for BF16 mixed-precision training using Megatron-LM with 2x Data Parallelism, 2x Pipeline Parallelism, and 2x Tensor Parallelism on 8 GPUs. It is also using Sequence Parallelism, selective activation checkpointing, and a sharded optimizer. | |
<pre> | |
compute_environment: LOCAL_MACHINE | |
deepspeed_config: {} | |
+distributed_type: MEGATRON_LM | |
downcast_bf16: 'no' | |
dynamo_backend: 'NO' | |
fsdp_config: {} | |
machine_rank: 0 | |
main_training_function: main | |
+megatron_lm_config: | |
+ megatron_lm_gradient_clipping: 1.0 | |
+ megatron_lm_num_micro_batches: 2 | |
+ megatron_lm_pp_degree: 2 | |
+ megatron_lm_recompute_activations: true | |
+ megatron_lm_sequence_parallelism: true | |
+ megatron_lm_tp_degree: 2 | |
+ megatron_lm_use_distributed_optimizer: true | |
mixed_precision: bf16 | |
num_machines: 1 | |
num_processes: 8 | |
rdzv_backend: static | |
same_network: true | |
use_cpu: false | |
</pre> | |
## | |
<pre> | |
from accelerate import Accelerator | |
+from accelerate.utils import MegatronLMDummyScheduler | |
accelerator = Accelerator() | |
... | |
-lr_scheduler = get_scheduler( | |
- name=args.lr_scheduler_type, | |
- ... | |
-) | |
+lr_scheduler = MegatronLMDummyScheduler( | |
+ optimizer=optimizer, | |
+ num_warmup_steps=..., | |
+ num_training_steps=..., | |
+) | |
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare( | |
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler | |
) | |
total_batch_size = ( | |
- args.per_device_train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps | |
+ accelerator.state.megatron_lm_plugin.global_batch_size | |
) | |
# in evaluation loop | |
for step, batch in enumerate(eval_dataloader): | |
with torch.no_grad(): | |
outputs = model(**batch) | |
loss = outputs.loss | |
- losses.append(accelerator.gather_for_metrics(loss.repeat(args.per_device_eval_batch_size))) | |
+ losses.append(loss) # For Megatron-LM, the losses are already averaged across the data parallel group | |
-losses = torch.cat(losses) | |
+losses = torch.tensor(losses) | |
</pre> | |
## | |
If the YAML was generated through the `accelerate config` command: | |
``` | |
accelerate launch {script_name.py} {--arg1} {--arg2} ... | |
``` | |
If the YAML is saved to a `~/config.yaml` file: | |
``` | |
accelerate launch --config_file ~/config.yaml {script_name.py} {--arg1} {--arg2} ... | |
``` | |
Or you can use `accelerate launch` with right configuration parameters and have no `config.yaml` file: | |
``` | |
accelerate launch \ | |
--use_megatron_lm \ | |
--num_processes=8 \ | |
--mixed_precision=bf16 \ | |
--megatron_lm_tp_degree=2 \ | |
--megatron_lm_pp_degree=2 \ | |
--megatron_lm_num_micro_batches=2 \ | |
--megatron_lm_sequence_parallelism=true \ | |
--megatron_lm_recompute_activations=true \ | |
--megatron_lm_use_distributed_optimizer=true \ | |
{script_name.py} {--arg1} {--arg2} ... | |
``` | |
## | |
For Megatron-LM, the supported models Transformers GPT2, Megatron-BERT and T5 models covering Decoder only, Encode only and Encoder-Decoder model classes. Given the complexity of the features of Megatron-LM, 4 changes that are required to get started are: | |
1. Using `accelerate.utils.MegatronLMDummyScheduler` as Megatron-LM uses its own implementation of Optimizer, the corresponding scheduler compatible with it needs to be used. | |
2. Getting the details of the total batch size now needs to be cognization of tensor and pipeline parallel sizes. | |
3. Losses are already averaged across the data parallel group | |
4. save the model using `accelerator.save_state` instead of transformers `from_pretrianed` | |
The Accelerate Megatron-LM integration supports many advanced features such as: | |
- Leveraging custom training steps | |
- Using Megatron-LM indexed datasets | |
- Checkpoint reshaping and interoperabiloity utilities | |
- Using `megatron_generate` for text generation using Tensor and Pipeline Parallism | |
- Support for ROPE/ALibi Positional embeddings and Multi-Query Attention | |
However, each of these require more changes to your source code than what is presented here. | |
## | |
To learn more checkout the related documentation: | |
- <a href="https://huggingface.co/docs/accelerate/usage_guides/megatron_lm" target="_blank">How to use Megatron-LM</a> | |
- <a href="https://github.com/pacman100/accelerate-megatron-test" target="_blank">Examples showcasing the Megatron-LM integration of Accelerate</a> |