In Documentation, I went to Transformer Notebooks and clicked on the collab for Getting Started Tokenizer. I executed each cell and when I got to the cell where:
TypeError: Can’t convert <tokenizers.trainers.BpeTrainer object at 0x7f8641325570> to Sequence
I am assuming these cells should work so something changed with the software but not updated in the notebook. I am trying to learn transformers on my own so where can I go to learn if Hugging Face Doc is not up to date? Any help will be appreciated.
Hi @krwin, this indeed seems to be a bug in the notebook, where the order of the arguments for tokenizer.train() in this cell
from tokenizers.trainers import BpeTrainer
# We initialize our trainer, giving him the details about the vocabulary we want to generate
trainer = BpeTrainer(vocab_size=25000, show_progress=True, initial_alphabet=ByteLevel.alphabet())
tokenizer.train(trainer, ["big.txt"])
print("Trained vocab size: {}".format(tokenizer.get_vocab_size()))
is back-to-front (see the docs). To fix the problem you can just specify the arguments explicitly: