Training BERT from scratch with Wikipedia + Book Corpus Dataset

Hello, everyone! I am a person who woks in a different field of ML and someone who is not very familiar with NLP. Hence I am seeking your help!

I want to pre-train the standard BERT model with the wikipedia and book corpus dataset (which I think is the standard practice!) for a part of my research work.

  1. I am following the huggingface guide to pretrain model from scratch: https://huggingface.co/blog/how-to-train

Now since they are training a different model on a different language dataset, in the article they mention:

We recommend training a byte-level BPE (rather than let’s say, a WordPiece tokenizer like BERT)

So, in my case, should I go for WordPiece tokenizer for BERT pretraining? (I have a slight idea about tokenizer but I am not learned enough to understand the ramifications of this).

Apart from this, from the article the only other deviation I see is the selection of the dataset, I understand Huggingface has both the wikipedia and the book corpus datasets.

'2. So, how should I go about training? Should I train the model on Wikipedia first and then on Book Corpus? Or should I somehow concatenate them into a larger singular dataset. Any other thing should I keep in mind?

I would really appreciate if someone could point me to materials/code for pretraining BERT.
Any other tips/suggestions would be highly appreciated! Thanks a lot!

Or should I somehow concatenate them into a larger singular dataset.

you would benefit from a bigger dataset;

should I go for WordPiece tokenizer for BERT pretraining?

BPE and WordPiece have a lot in common:
https://huggingface.co/transformers/tokenizer_summary.html
BERT is trained with WordPiece, so it is natural to choose WordPiece in this case.