Convert slow XLMRobertaTokenizer to fast one

nielsr · July 26, 2022, 8:47am

Posting a question I had which @SaulLu answered suppose you have a repo on the hub that only has slow tokenizer files, and you want to be able to load a fast tokenizer. Here’s how to do that:

!pip install -q transformers sentencepiece

model_name = "naver-clova-ix/donut-base-finetuned-docvqa"

from transformers import XLMRobertaTokenizerFast

tokenizer = XLMRobertaTokenizerFast.from_pretrained(model_name, from_slow=True)

tokenizer.save_pretrained("fast_tok", legacy_format=False)

ahmedlone123 · January 24, 2023, 12:20pm

hey @nielsr @SaulLu , Thank you for this but what if you have fast tokenizers and you want to conver them to slow tokenizers? I am thinking that somehow you could parse the .json (when saving a fast tokenizer) and create the sentencpeice model you need for slow tokenizers from the json. what do you think?

abdul-r17 · June 15, 2024, 10:56am

Hi @nielsr @ahmedlone123 , I want to convert a slow tokenizer (LlamaTokenizer) to a fast one (LlamaTokenizerFast), without losing the legacy behaviour i.e, it should not append a dummy prefix “_” in front of the sentence. It works fine with LlamaTokenizer, but when I load the same tokenizer using LlamaTokenizerFast, this behaviour is lost. How to retain the default behaviour of LlamaTokenizer (not adding a dummy prefix) in fast tokenizer?

itazap · August 26, 2024, 5:36pm

Hi @abdul-r17 . Both slow and fast Llama add a dummy prefix. This can be controlled with the add_prefix_space=False parameter which you can set on the model before you save or when loading the tokenizer:

tokenizer = LlamaTokenizer.from_pretrained(model, add_prefix_space=False)
tokenizer.save_pretrained(temp_folder)
LlamaTokenizerFast.from_pretrained(temp_folder)

or when loading:

tokenizer = LlamaTokenizer.from_pretrained(model)
tokenizer.save_pretrained(temp_folder)
LlamaTokenizerFast.from_pretrained(temp_folder, add_prefix_space=False)

Hope that works in your use case!

Topic		Replies	Views
Convert a Python Tokenizer into a TokenizerFast Beginners	0	333	May 20, 2022
Are the slow and fast tokenizer results the same output for the same input? 🤗Tokenizers	0	533	August 30, 2023
AutoTokenizer is very slow when loading llama tokenizer 🤗Tokenizers	2	1773	October 31, 2023
How to save a fast tokenizer using the transformer library and then load it using Tokenizers? 🤗Tokenizers	7	3309	December 14, 2022
How to convert Tokenizer to TokenizerFast? Beginners	1	527	September 30, 2020

Convert slow XLMRobertaTokenizer to fast one

Related topics