Hi
For domain-specific data, let’s say medical drug data with complicated chemical compounds names. Would it be beneficial to train a tokenizer on the text if the size was nearly 18 M entries? In the bioBERT paper, they used a pre-trained BERT paper for the following reasons:
- compatibility of BioBERT with BERT, which allows BERT pre-trained on general domain corpora to be re-used, and makes it easier to interchangeably use existing models based on BERT and BioBERT
- any new words may still be represented and fine-tuned for the biomedical domain using the original WordPiece vocabulary of BERT.