Hello, everyone! I am a person who woks in a different field of ML and someone who is not very familiar with NLP. Hence I am seeking your help!
I want to pre-train the standard BERT model with the wikipedia and book corpus dataset (which I think is the standard practice!) for a part of my research work.
- I am following the huggingface guide to pretrain model from scratch: https://huggingface.co/blog/how-to-train
Now since they are training a different model on a different language dataset, in the article they mention:
We recommend training a byte-level BPE (rather than let’s say, a WordPiece tokenizer like BERT)
So, in my case, should I go for WordPiece tokenizer for BERT pretraining? (I have a slight idea about tokenizer but I am not learned enough to understand the ramifications of this).
Apart from this, from the article the only other deviation I see is the selection of the dataset, I understand Huggingface has both the wikipedia and the book corpus datasets.
'2. So, how should I go about training? Should I train the model on Wikipedia first and then on Book Corpus? Or should I somehow concatenate them into a larger singular dataset. Any other thing should I keep in mind?
I would really appreciate if someone could point me to materials/code for pretraining BERT.
Any other tips/suggestions would be highly appreciated! Thanks a lot!