Issue with Transformer notebook's Getting Started Tokenizers

krwin · January 29, 2021, 11:11pm

In Documentation, I went to Transformer Notebooks and clicked on the collab for Getting Started Tokenizer. I executed each cell and when I got to the cell where:

from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(vocab_size=25000, show_progress=True, initial_alphabet=ByteLevel.alphabet())
tokenizer.train(trainer, [“big.txt”])

print(“Trained vocab size: {}”.format(tokenizer.get_vocab_size()))

I ran the cell and got this:

TypeError: Can’t convert <tokenizers.trainers.BpeTrainer object at 0x7f8641325570> to Sequence

I am assuming these cells should work so something changed with the software but not updated in the notebook. I am trying to learn transformers on my own so where can I go to learn if Hugging Face Doc is not up to date? Any help will be appreciated.

lewtun · January 30, 2021, 11:43am

Hi @krwin, this indeed seems to be a bug in the notebook, where the order of the arguments for tokenizer.train() in this cell

from tokenizers.trainers import BpeTrainer

# We initialize our trainer, giving him the details about the vocabulary we want to generate
trainer = BpeTrainer(vocab_size=25000, show_progress=True, initial_alphabet=ByteLevel.alphabet())
tokenizer.train(trainer, ["big.txt"])

print("Trained vocab size: {}".format(tokenizer.get_vocab_size()))

is back-to-front (see the docs). To fix the problem you can just specify the arguments explicitly:

tokenizer.train(trainer=trainer, files=["big.txt"])

cc: @anthony

krwin · January 30, 2021, 9:58pm

Thank you very much for helping me and being so prompt.
Hope you have a great day!

Topic		Replies	Views
Training Transformer XL from scratch Beginners	0	888	May 22, 2021
Training sentencePiece from scratch? 🤗Tokenizers	8	17716	December 19, 2023
Questions when doing Transformer-XL Finetune with Trainer Beginners	3	1043	October 6, 2021
Two approaches to training a tokenizer Beginners	0	950	March 6, 2023
HuggingFace BPE Trainer Error - Training Tokenizer 🤗Tokenizers	1	2884	July 14, 2022

Issue with Transformer notebook's Getting Started Tokenizers

In Documentation, I went to Transformer Notebooks and clicked on the collab for Getting Started Tokenizer. I executed each cell and when I got to the cell where:

print(“Trained vocab size: {}”.format(tokenizer.get_vocab_size()))

TypeError: Can’t convert <tokenizers.trainers.BpeTrainer object at 0x7f8641325570> to Sequence

Related topics