Performance vs the original architecture on approximate original data sizes (BooksCorpus/Wikipedia)

#54
by tollefj - opened

There's a tremendous difference between the data sizes used for pre-training ModernBERT compared to the original BERT models (1.7T tokens vs. 3.3B words). How much of the performance is gained from more comprehensive data sources? Or have I missed some details about this in the paper?

Sign up or log in to comment