Hi. I’m currently using BERT-like models (e.g., bert-base-cased, bert-base-multilingual-cased) for a project at work. The data that I’m using produces a lot of [UNK] using readily available tokenizers and so I wanted to create my own tokenizer. However, I’m wondering if I can just make a new tokenizer with a new vocabulary and use that with one of the standard model cards.
My thinking is that it won’t properly work, since the available models have been pretrained using their own tokenizers. However, I’m curious if this approach would be viable. Thanks.
Yeah, it won’t work well. With a new tokenizer, you’d have retrain the embedding layer at the very least (it could be of different size as well). But I’d say this is a pretty radical solution. I mean these tokenizers have individual characters as tokens as well, so there really shouldn’t be that many [UNK], unless you have characters there that weren’t present in the training data. I think, it should be possible to extend the tokenizers vocabulary with additional tokens, if you know what these are. You would still have to train the embeddings for them though, so it probably won’t work well right off the bat.