I finetuned a pre-trained BERT model on my custom dataset for the LM task, to introduce new vocabularies (around 40k new tokens) from my dataset. Now that I am trying to further finetune the trained model on another classification task, I have been unable to load the pre-trained tokenizer with added vocabulary properly.
I tried loading it up using BERTTokenizer, encoding/tokenizing each sentence using encode_plus takes me 1m 23sec. That’s too much considering I have over 200k sentences for classification just in my training data. I know that I can also use batch_encode_plus with parallelization, but even then, it will take forever to encode just my training data.
I also tried loading it up using BertTokenizerFast and AutoTokenizer, but they take forever to load up.
I tried running the same script with the pre-trained BERT tokenizers without my added tokens, and it takes a fraction of seconds (994 us) to encode the entire batch. So the problem is definitely with my own pre-trained tokenizer, which has the newly added tokens.
Has anyone encountered a similar problem before? While pertaining, I used AutoTokenizer save_pretrained function. When I check the tokenizer after loading it up using BERTTokenizer, I can see all the newly added tokens using the get_vocab() function. So it’s unlikely that something went wrong while saving it.
Hi, the base classes PreTrainedTokenizer and PreTrainedTokenizerFast implement the common methods for encoding string inputs in model inputs (see below) and instantiating/saving python and “Fast” tokenizers either from a local file or directory or from a pretrained tokenizer provided by the library (downloaded from HuggingFace’s AWS S3 repository). They both rely on PreTrainedTokenizerBase that contains the common methods, and SpecialTokensMixin.
PreTrainedTokenizer and PreTrainedTokenizerFast thus implement the main methods for using all the tokenizers:
Tokenizing (splitting strings in sub-word token strings), converting tokens strings to ids and back, and encoding/decoding (i.e., tokenizing and converting to integers).
Adding new tokens to the vocabulary in a way that is independent of the underlying structure (BPE, SentencePiece…).
Managing special tokens (like mask, beginning-of-sentence, etc.): adding them, assigning them to attributes in the tokenizer for easy access and making sure they are not split during tokenization.
BatchEncoding holds the output of the PreTrainedTokenizerBase’s encoding methods (call, encode_plus and batch_encode_plus) and is derived from a Python dictionary. When the tokenizer is a pure python tokenizer, this class behaves just like a standard python dictionary and holds the various model inputs computed by these methods (input_ids, attention_mask…). When the tokenizer is a “Fast” tokenizer (i.e., backed by HuggingFace tokenizers library), this class provides in addition several advanced alignment methods which can be used to map between the original string (character and words) and the token space (e.g., getting the index of the token comprising a given character or the span of characters corresponding to a given token). If you still haven’t got the answers to what you are looking for, you may get in touch with chatbot development services company. They can provide you with a free consultation.