RoBERTa MLM fine-tuning

stefan-jo · October 1, 2020, 10:16am

Hello,

I want to fine-tune RoBERTa for MLM on a dataset of about 200k texts. The texts are reviews from online forums ranging from basic conversations to technical descriptions with a very specific vocabulary.

I have two questions regarding data preparation:

Can I simply use RobertaTokenizer.from_pretrained("roberta-base") even if the vocabulary of my fine-tuning corpus might differ significantly from the pre-training corpus? Or is there a way to “adjust” the tokenizer to the new data?
Each review comes with the title of the thread it has been posted in. From earlier experiments I know that concatenating titles and texts (and adding a special separator token between them) improves model performance for classification. However, I am wondering how this should be handled during language model fine-tuning? Since some threads contain hundreds of reviews, it seems wasteful for the language model to predict on the same title over and over again.

Pataleros · November 24, 2021, 4:18pm

Hello there,

I am currently trying to do the same : fine-tune Roberta on a very specific vocabulary of mine (let’s say : biology stuff).

About your first question, you should at least add some new words, specific to your vocabulary, in the Tokenizer vocabulary. See this discussion : how can i finetune BertTokenizer? · Issue #2691 · huggingface/transformers · GitHub

Considering the MLM training, what class did you use exactly ? I am looking for more info online, found this (NLP-with-Deep-Learning/fine_tuning_bert_with_MLM.ipynb at master · yash-007/NLP-with-Deep-Learning · GitHub) but wonder how this would work for Roberta.

Thanks

Topic		Replies	Views
Domain adaptation of Language Model and Tokenizer Beginners	8	2338	June 17, 2024
RoBERTa from scratch with different vocab vs. fine-tuning Intermediate	9	2167	August 20, 2020
Fine tune a saved model with custom tokenizer 🤗Transformers	3	2825	December 15, 2020
Further pre-training the tokenizer? 🤗Tokenizers	0	810	April 30, 2022
Training embeddings of tokens 🤗Transformers	2	4954	January 27, 2021

RoBERTa MLM fine-tuning

Related topics