I am trying to fine tune BERT for Masked Language Modeling and I would like to use a dataset that already contains masked tokens (I want to mask particular words rather than randomly chosen ones).
How can I do this?
I am following these https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/language_modeling.ipynb#scrollTo=KDBi0reX3l_g
instructions, but I am not sure which parts of the code I need to change for it to be compatible with a dataset that already has [MASK] tokens in it.
Thanks!
The masking is done by the data collator DataCollatorForLanguageModeling
. Just pass along mlm=False
to that data collator to deactivate the random masking there.
1 Like
Thank you so much!