Posting a question I had which @SaulLu answered suppose you have a repo on the hub that only has slow tokenizer files, and you want to be able to load a fast tokenizer. Hereās how to do that:
hey @nielsr@SaulLu , Thank you for this but what if you have fast tokenizers and you want to conver them to slow tokenizers? I am thinking that somehow you could parse the .json (when saving a fast tokenizer) and create the sentencpeice model you need for slow tokenizers from the json. what do you think?
Hi @nielsr@ahmedlone123 , I want to convert a slow tokenizer (LlamaTokenizer) to a fast one (LlamaTokenizerFast), without losing the legacy behaviour i.e, it should not append a dummy prefix ā_ā in front of the sentence. It works fine with LlamaTokenizer, but when I load the same tokenizer using LlamaTokenizerFast, this behaviour is lost. How to retain the default behaviour of LlamaTokenizer (not adding a dummy prefix) in fast tokenizer?
Hi @abdul-r17 . Both slow and fast Llama add a dummy prefix. This can be controlled with the add_prefix_space=False parameter which you can set on the model before you save or when loading the tokenizer: