This is an error with the MLM
script (PyTorch) for attempting to pre-train BigBird on TPUs over XLA. The dataset in question is a custom dataset, and the model config and tokenizer has been initialized appropriately.
This is a continuation of this unanswered Forum post that faces the same error.
Command used to run the script:-
%%bash
python xla_spawn.py --num_cores=8 ./run_mlm.py --output_dir="./results" \
--model_type="big_bird" \
--config_name="./config" \
--tokenizer_name="./tokenizer" \
--train_file="./dataset.txt" \
--validation_file="./val.txt" \
--line_by_line="True" \
--max_seq_length="16000" \
--weight_decay="0.01" \
--per_device_train_batch_size="1" \
--per_device_eval_batch_size="1" \
--learning_rate="3e-4" \
--tpu_num_cores='8' \
--warmup_steps="1000" \
--overwrite_output_dir \
--pad_to_max_length \
--num_train_epochs="5" \
--adam_beta1="0.9" \
--adam_beta2="0.98" \
--do_train \
--do_eval \
--logging_steps="50" \
--evaluation_strategy="steps" \
--eval_accumulation_steps='10' \
--report_to="tensorboard" \
--logging_dir='./logs' \
--save_strategy="epoch" \
--load_best_model_at_end='True' \
--metric_for_best_model='validation' \
--preprocessing_num_workers='15'
I am facing two errors to be precise,
Exception in device=TPU:0: Default process group has not been initialized, please make sure to call init_process_group.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/transformers/training_args.py", line 1006, in main_process_first
yield
File "/content/run_mlm.py", line 393, in main
desc="Running tokenizer on dataset line_by_line",
File "/usr/local/lib/python3.7/dist-packages/datasets/dataset_dict.py", line 489, in map
for k, dataset in self.items()
File "/usr/local/lib/python3.7/dist-packages/datasets/dataset_dict.py", line 489, in <dictcomp>
for k, dataset in self.items()
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 1664, in map
for rank in range(num_proc)
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 1664, in <listcomp>
for rank in range(num_proc)
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2664, in shard
writer_batch_size=writer_batch_size,
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 186, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/datasets/fingerprint.py", line 397, in wrapper
out = func(self, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2254, in select
return self._new_dataset_with_indices(indices_buffer=buf_writer.getvalue(), fingerprint=new_fingerprint)
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2170, in _new_dataset_with_indices
fingerprint=fingerprint,
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 297, in __init__
self._indices.column(0)[0].type
File "pyarrow/table.pxi", line 162, in pyarrow.lib.ChunkedArray.__getitem__
File "pyarrow/array.pxi", line 549, in pyarrow.lib._normalize_index
IndexError: index out of bounds
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
_start_fn(index, pf_cfg, fn, args)
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
fn(gindex, *args)
File "/content/run_mlm.py", line 529, in _mp_fn
main()
File "/content/run_mlm.py", line 393, in main
desc="Running tokenizer on dataset line_by_line",
File "/usr/lib/python3.7/contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.7/dist-packages/transformers/training_args.py", line 1011, in main_process_first
torch.distributed.barrier()
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 2523, in barrier
default_pg = _get_default_group()
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 358, in _get_default_group
raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
I haven’t modified the script to call the init_process_group
yet, focusing on the earlier error of index out of bounds. Clearly, the problem is arising from my own dataset - which was working before however. Interestingly, we get it when its in the tokenizing stage.
At some point, when constructing the arrow dataset its failing. I have no idea about Apache Arrow, so I can’t debug further
Can anyone give me some guidance on where should I start to investigate the error and some possible leads as to the origin?