Using a pretrained tokenizer vs training a one from scratch

abdallah197 · August 20, 2020, 10:14am

Hi

For domain-specific data, let’s say medical drug data with complicated chemical compounds names. Would it be beneficial to train a tokenizer on the text if the size was nearly 18 M entries? In the bioBERT paper, they used a pre-trained BERT paper for the following reasons:

compatibility of BioBERT with BERT, which allows BERT pre-trained on general domain corpora to be re-used, and makes it easier to interchangeably use existing models based on BERT and BioBERT
any new words may still be represented and fine-tuned for the biomedical domain using the original WordPiece vocabulary of BERT.

rgwatwormhill · August 21, 2020, 9:15am

How many different chemical compound names are there in the 18 M entries?

Having lots of data is good, but I don’t think training a tokenizer would help you unless the words you are interested in are frequent enough to be selected for the tokenizer’s vocabulary. I’m not sure how the tokenizer chooses its vocabulary, but word-frequency must be important. I’m guessing that “medical drug data” would still include lots of normal-English words, many of which would be more frequent than the chemicals.

[I am not an expert].

Topic		Replies	Views
Identifying most useful domain-specific tokens for adding to the existing tokenizer Intermediate	1	460	February 2, 2024
Questions about the connection between tokenizer and the model Beginners	0	288	September 19, 2023
Domain adaptation of Language Model and Tokenizer Beginners	8	2338	June 17, 2024
Error using pretraining tokenizer for spanish biomedical ner Beginners	0	283	May 16, 2022
Fine-tuning BERT Model on domain specific language and for classification 🤗Transformers	7	8057	November 14, 2024

Using a pretrained tokenizer vs training a one from scratch

Related topics