Hi,
I am pre training a Bert model from scratch. For that I first need to train a wordpiece tokenizer, I am using BertWordPieceTokenizer for this.
My question:
Should I train the tokenizer on the whole corpus which is huge, or training it on a sample is enough?
Is there a way to tell the tokenizer to take train only on a sample?
Thanks.