Original Bert Pretraining

lucadini · January 10, 2022, 10:19am

Hi,

I would like to repeat the original BERT pre-training with my data. I was trying to use the example script of the huggingface library (transformers/run_mlm_no_trainer.py at master · huggingface/transformers · GitHub).
The problem is that the data are provided differently to the model: in the original pre-training the sentences were not real sentences but chunks of text with length “max_sequence-length”. This way of providing data is used if you set the parameter “line_by_line” to false but here arrives my problem.
Having a huge dataset and training in a GPU cluster with a limit to execution time (24h then you can restart from a training checkpoint), I must tokenize the sentences on-the-fly (as explained in 1.3GB dataset creates over 107GB of cache file! · Issue #10204 · huggingface/transformers · GitHub) and then I can’t use the “group_texts” function in advance because it requires tokenization.

How can i solve this problem? Are you aware of a pre-training script that follows the original pre-training and can be used with huge dataset with limited execution time?

Thanks in advance!

EDIT: tokenizing all the dataset in advance and saving it this way is not an option because I need to keep all the tokenization informations for other experiments and, as reported in the previous linked issue, tokenization creates GB (in my case wolud be TB) of files!

Topic		Replies	Views
Pre-Train BERT (from scratch) Research	43	18719	June 27, 2022
How to "further pretrain" a tokenizer (do I need to do so?) 🤗Tokenizers	5	4192	February 20, 2022
How to continue BERT training 🤗Transformers	1	1280	March 4, 2022
Pre-train BERT with HF Trainer 🤗Transformers	0	727	April 22, 2022
Data preprocessing steps for pretraining BERT from scratch Beginners	1	3500	January 30, 2022

Original Bert Pretraining

Related topics