Hi everyone,
I am running into an ongoing problem when training my language model from scratch using the following tutorial: notebooks/language_modeling_from_scratch.ipynb at master · huggingface/notebooks · GitHub
I have trained my tokenizer on a Word Piece model like BERT, where I added my special tokens and then successfully saved the tokenizer, in which I am wanting to use it to train my language model from scratch.
Now, when I run the following code to train my model:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=lm_datasets["train"],
eval_dataset=lm_datasets["test"],
data_collator=data_collator,
)
trainer.train()
This is the error I get, which I do not know why it is happening:
***** Running training *****
Num examples = 5660
Num Epochs = 3
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 1
Total optimization steps = 2124
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-28-3435b262f1ae> in <module>()
----> 1 trainer.train()
11 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
2041 # remove once script supports set_grad_enabled
2042 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2043 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
2044
2045
IndexError: index out of range in self