Convert slow XLMRobertaTokenizer to fast one

Posting a question I had which @SaulLu answered :smiley: suppose you have a repo on the hub that only has slow tokenizer files, and you want to be able to load a fast tokenizer. Hereā€™s how to do that:

!pip install -q transformers sentencepiece

model_name = "naver-clova-ix/donut-base-finetuned-docvqa"

from transformers import XLMRobertaTokenizerFast

tokenizer = XLMRobertaTokenizerFast.from_pretrained(model_name, from_slow=True)

tokenizer.save_pretrained("fast_tok", legacy_format=False)
1 Like

hey @nielsr @SaulLu , Thank you for this but what if you have fast tokenizers and you want to conver them to slow tokenizers? I am thinking that somehow you could parse the .json (when saving a fast tokenizer) and create the sentencpeice model you need for slow tokenizers from the json. what do you think?

Hi @nielsr @ahmedlone123 , I want to convert a slow tokenizer (LlamaTokenizer) to a fast one (LlamaTokenizerFast), without losing the legacy behaviour i.e, it should not append a dummy prefix ā€œ_ā€ in front of the sentence. It works fine with LlamaTokenizer, but when I load the same tokenizer using LlamaTokenizerFast, this behaviour is lost. How to retain the default behaviour of LlamaTokenizer (not adding a dummy prefix) in fast tokenizer?

Hi @abdul-r17 . Both slow and fast Llama add a dummy prefix. This can be controlled with the add_prefix_space=False parameter which you can set on the model before you save or when loading the tokenizer:

tokenizer = LlamaTokenizer.from_pretrained(model, add_prefix_space=False)
tokenizer.save_pretrained(temp_folder)
LlamaTokenizerFast.from_pretrained(temp_folder)

or when loading:

tokenizer = LlamaTokenizer.from_pretrained(model)
tokenizer.save_pretrained(temp_folder)
LlamaTokenizerFast.from_pretrained(temp_folder, add_prefix_space=False)

Hope that works in your use case! :smile: