Evaluation metrics for BERT-like LMs

vblagoje · September 24, 2020, 8:16pm

Hey guys,

I’ve read that Perplexity (PPL) is one of the most common metrics for evaluating autoregressive and causal language models. But what do we use for MLMs like BERT?

I need to evaluate BERT models after pre-training and compare them to existing BERT models without going through downstream task GLUE-like benchmarks.

Best,
Vladimir

vblagoje · September 25, 2020, 12:33pm

I found an interesting project https://github.com/awslabs/mlm-scoring which seems to be the step in the right direction. The authors also published the paper https://arxiv.org/pdf/1910.14659v2.pdf

stefan-it · October 4, 2020, 12:23am

Hi Vladimir,

before releasing new models, I usually perform evaluations for multiple checkpoints on at least two downstream tasks (normally Pos tagging or NER).

But maybe you can also evaluate the MLM capability for some checkpoints, like it is shown in the following paper:

I would use the “Cloze test word prediction” task. It masks out some subwords from an input sentence, tries to re-construct the masked subwords and calculates accuracy. With that task you could at least measure the MLM capability of your checkpoints, without performing extensive hyper-parameter search and multiple runs as you do for down-stream tasks.

vblagoje · October 5, 2020, 2:58pm

Thanks a lot @stefan-it I see the project is using the old HF naming scheme but it shouldn’t be hard to update.

sxdmit · December 6, 2024, 12:03pm

make it a bit simple, you can always trust chatgpt. he knows it.

Topic		Replies	Views
Metrics for masked language modeling (mlm) Beginners	0	435	September 16, 2021
How to correctly evaluate a Masked Language Model? 🤗Transformers	3	4037	August 11, 2023
Using BERT and RoBERTa for (causal?) language modeling 🤗Transformers	6	5048	October 2, 2021
How to evaluate bert model from MLM task result? Beginners	0	238	October 4, 2023
Useful compute_metrics functions for perplexity 🤗Transformers	0	606	September 29, 2022

Evaluation metrics for BERT-like LMs

Related topics