IndicTrans-MultilingualTranslation / legacy /tpu_training_instructions.md
hussain-shk's picture
Duplicate from ai4bharat/IndicTrans-MultilingualTranslation
ef23634
## Instructions to run on Google cloud TPUs
Before starting these steps, make sure to prepare the dataset (normalization -> bpe -> .. -> binarization) following the steps in indicTrans workflow or do these steps on a cpu instance before launching the tpu instance (to save time and costs)
### Creating TPU instance
- Create a cpu instance on gcp with `torch-xla` image like:
```bash
gcloud compute --project=${PROJECT_ID} instances create <name for your instance> \
--zone=<zone> \
--machine-type=n1-standard-16 \
--image-family=torch-xla \
--image-project=ml-images \
--boot-disk-size=200GB \
--scopes=https://www.googleapis.com/auth/cloud-platform
```
- Once the instance is created, Launch a Cloud TPU (from your cpu vm instance) using the following command (you can change the `accelerator_type` according to your needs):
```bash
gcloud compute tpus create <name for your TPU> \
--zone=<zone> \
--network=default \
--version=pytorch-1.7 \
--accelerator-type=v3-8
```
(or)
Create a new tpu using the GUI in https://console.cloud.google.com/compute/tpus and make sure to select `version` as `pytorch 1.7`.
- Once the tpu is launched, identify its ip address:
```bash
# you can run this inside cpu instance and note down the IP address which is located under the NETWORK_ENDPOINTS column
gcloud compute tpus list --zone=us-central1-a
```
(or)
Go to https://console.cloud.google.com/compute/tpus and note down ip address for the created TPU from the `interal ip` column
### Installing Fairseq, getting data on the cpu instance
- Activate the `torch xla 1.7` conda environment and install necessary libs for IndicTrans (**Excluding FairSeq**):
```bash
conda activate torch-xla-1.7
pip install sacremoses pandas mock sacrebleu tensorboardX pyarrow
```
- Configure environment variables for TPU:
```bash
export TPU_IP_ADDRESS=ip-address; \
export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
```
- Download the prepared binarized data for FairSeq
- Clone the latest version of Fairseq (this supports tpu) and install from source. There is an [issue](https://github.com/pytorch/fairseq/issues/3259) with the latest commit and hence we use a different commit to install from source (This may have been fixed in the latest master but we have not tested it.)
```bash
git clone https://github.com/pytorch/fairseq.git
git checkout da9eaba12d82b9bfc1442f0e2c6fc1b895f4d35d
pip install --editable ./
```
- Start TPU training
```bash
# this is for using all tpu cores
export MKL_SERVICE_FORCE_INTEL=1
fairseq-train {expdir}/exp2_m2o_baseline/final_bin \
--max-source-positions=200 \
--max-target-positions=200 \
--max-update=1000000 \
--save-interval=5 \
--arch=transformer \
--attention-dropout=0.1 \
--criterion=label_smoothed_cross_entropy \
--source-lang=SRC \
--lr-scheduler=inverse_sqrt \
--skip-invalid-size-inputs-valid-test \
--target-lang=TGT \
--label-smoothing=0.1 \
--update-freq=1 \
--optimizer adam \
--adam-betas '(0.9, 0.98)' \
--warmup-init-lr 1e-07 \
--lr 0.0005 \
--warmup-updates 4000 \
--dropout 0.2 \
--weight-decay 0.0 \
--tpu \
--distributed-world-size 8 \
--max-tokens 8192 \
--num-batch-buckets 8 \
--tensorboard-logdir {expdir}/exp2_m2o_baseline/tensorboard \
--save-dir {expdir}/exp2_m2o_baseline/model \
--keep-last-epochs 5 \
--patience 5
```
**Note** While training, we noticed that the training was slower on tpus, compared to using multiple GPUs, we have documented some issues and [filed an issue](https://github.com/pytorch/fairseq/issues/3317) at fairseq repo for advice. We'll update this section as we learn more about efficient training on TPUs. Also feel free to open an issue/pull request if you find a bug or know an efficient method to make code train faster on tpus.