Tokenizer BOS behavior is inconsistent with Llama 3.1
#5
by
dzhulgakov
- opened
While it works out for chat template, when calling vanilla encode
the tokenizers for 3.2 don't prepend BOS while 3.1 ones do. E.g.
>>> transformers.AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct").encode("hello")
[128000, 15339]
>>> transformers.AutoTokenizer.from_pretrained("nltpt/Llama-3.2-1B-Instruct").encode("hello")
[15339]
Is it intended?
It should be fixed after https://huggingface.co/nltpt/Llama-3.2-1B-Instruct/discussions/8
osanseviero
changed discussion status to
closed