I’ve read that Perplexity (PPL) is one of the most common metrics for evaluating autoregressive and causal language models. But what do we use for MLMs like BERT?
I need to evaluate BERT models after pre-training and compare them to existing BERT models without going through downstream task GLUE-like benchmarks.
before releasing new models, I usually perform evaluations for multiple checkpoints on at least two downstream tasks (normally Pos tagging or NER).
But maybe you can also evaluate the MLM capability for some checkpoints, like it is shown in the following paper:
I would use the “Cloze test word prediction” task. It masks out some subwords from an input sentence, tries to re-construct the masked subwords and calculates accuracy. With that task you could at least measure the MLM capability of your checkpoints, without performing extensive hyper-parameter search and multiple runs as you do for down-stream tasks.