Training BERT from scratch with Wikipedia + Book Corpus Dataset

rio1210 · January 18, 2021, 3:05am

Hello, everyone! I am a person who woks in a different field of ML and someone who is not very familiar with NLP. Hence I am seeking your help!

I want to pre-train the standard BERT model with the wikipedia and book corpus dataset (which I think is the standard practice!) for a part of my research work.

I am following the huggingface guide to pretrain model from scratch: https://huggingface.co/blog/how-to-train

Now since they are training a different model on a different language dataset, in the article they mention:

We recommend training a byte-level BPE (rather than let’s say, a WordPiece tokenizer like BERT)

So, in my case, should I go for WordPiece tokenizer for BERT pretraining? (I have a slight idea about tokenizer but I am not learned enough to understand the ramifications of this).

Apart from this, from the article the only other deviation I see is the selection of the dataset, I understand Huggingface has both the wikipedia and the book corpus datasets.

'2. So, how should I go about training? Should I train the model on Wikipedia first and then on Book Corpus? Or should I somehow concatenate them into a larger singular dataset. Any other thing should I keep in mind?

I would really appreciate if someone could point me to materials/code for pretraining BERT.
Any other tips/suggestions would be highly appreciated! Thanks a lot!

VP1 · January 22, 2021, 9:52am

Or should I somehow concatenate them into a larger singular dataset.

you would benefit from a bigger dataset;

should I go for WordPiece tokenizer for BERT pretraining?

BPE and WordPiece have a lot in common:
https://huggingface.co/transformers/tokenizer_summary.html
BERT is trained with WordPiece, so it is natural to choose WordPiece in this case.

Topic		Replies	Views
Train wordpiece from scratch 🤗Tokenizers	2	1273	September 9, 2021
Data preprocessing steps for pretraining BERT from scratch Beginners	1	3463	January 30, 2022
Pre-Train BERT (from scratch) Research	43	18677	June 27, 2022
Doing classification 100% from scratch? 🤗Transformers	4	1625	September 17, 2021
Bert Data Preparation Beginners	1	414	November 8, 2021

Training BERT from scratch with Wikipedia + Book Corpus Dataset

Related topics