Performance vs the original architecture on approximate original data sizes (BooksCorpus/Wikipedia)

#54

by tollefj - opened 5 days ago

5 days ago

There's a tremendous difference between the data sizes used for pre-training ModernBERT compared to the original BERT models (1.7T tokens vs. 3.3B words). How much of the performance is gained from more comprehensive data sources? Or have I missed some details about this in the paper?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment