Tokenization question, is HF implementation here correct?
Howdy!
I was participating in getting this model run on .gguf
over here: https://github.com/ggerganov/llama.cpp/pull/6033
There was one question I had there about tokenization, that I think is still a bit unresolved. There is a difference how the current llama.cpp
as of merging that PR and this HF model behaves when tokenizing.
I'll use this string as an example: hello\n\nworld
In llama.cpp
This is tokenized to: [34313, 2126, 17080]
. (34313 = hello, 2126 = '\n\n', 17080 = world)
In the Huggingface implementation here, I see it tokenized instead as: [34313, 206, 206, 17080]
. Same thing, but 206 is just one newline '\n'. So HF tokenizes double new line '\n\n' to [206, 206] instead of one token [2126].
I presume there would be other combinations of tokens as well that would get tokenized differently. Perhaps much more likely to happen with foreign text or emojis or unusual text? (haven't tested; but I'm guessing).
My question is: Is the Huggingface implementation here doing it correctly? Have you trained the model with the same tokenization that this implementation here uses? Or is llama.cpp
behavior where we tokenized \n\n to 2126 more correct? Do you have an idea which tokenization will yield better results? It doesn't seem to matter much functionally in empirical tests, but maybe there's a slight difference in smarts if the model has never seen token 2126 for example in training.
I would presume llama.cpp
is doing it "more correctly" because 2126 exists in the dictionary. Presumably because it is in the dictionary, meant to be used. But I can't tell for sure from outside.
Copypasting from the llama.cpp pull request comment (https://github.com/ggerganov/llama.cpp/pull/6033#issuecomment-2000227286), this can slightly modify logits and their ordering, below is an example for a test with about 2200 tokens and checking logits for next token:
HF implementation
| 6315 | Black | 10.5 |
| 9732 | Western | 10.2890625 |
| 7397 | Vir | 10.2421875 |
| 6842 | region | 9.953125 |
| 14664 | Eastern | 9.859375 |
| 4376 | known | 9.640625 |
| 4903 | major | 9.625 |
| 5079 | city | 9.609375 |
| 4509 | City | 9.4140625 |
| 7155 | Av | 9.3515625 |
HF implementation if used with llama.cpp tokenization (i.e. I gave the model tokens as a list of integers from llama.cpp instead letting it tokenize for me)
| 6315 | Black | 10.6796875 |
| 7397 | Vir | 10.46875 |
| 9732 | Western | 10.3046875 |
| 6842 | region | 10.015625 |
| 14664 | Eastern | 9.828125 |
| 12999 | Southern | 9.5 |
| 5079 | city | 9.4609375 |
| 4903 | major | 9.40625 |
| 7155 | Av | 9.3203125 |
| 71010 | jungle | 9.2734375 |
Maybe if it was a longer text it might affect it more. In my test text of about 2200 tokens, only something like 7 tokens were different, and all of them about \n\n for 206, 206 vs 2126.
Maybe answering my own question here: I thought of simple test to check this. I gave this prompt to the model:
<|START_OF_TURN_TOKEN|><|USER_TOKEN|>
Show me newlines.
<|END_OF_TURN_TOKEN|>
<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
And then monitored the logits as it started generating. Token 2126 (= double newline \n\n
) showed up a lot at top logits, e.g.:
If it shows up on the logits a lot when it makes sense, that tells me 2126 (double newline) was present at the training time and the model is aware of it. So I'm leaning on llama.cpp
having the correct behavior.
If this is a correct reasoning, then this discussion becomes a bug report: This model does not tokenize string prompts entirely correctly, the same way training has done. hello\n\nworld
would be an example string that is tokenized wrong, [34313, 206, 206, 17080]
instead of [34313, 2126, 17080]
. Generation seems fine.
If you are able to confirm if this reasoning is correct that would be sweet.
hi
@Noeda
, thanks for carefully checking this. The HF tokenizer is correct. I did a quick test using this string: hello\n\n world
(space between \n\n and world) and tokenizer returns [5, 28339, 2126, 3845]
where 5
is bos_token. Indeed, \n\n
is present during training, and the tokenizer encodes based on the text. Also, I double checked with tokenizers implementation too: https://huggingface.co/Cohere/Command-nightly, and the tokenizer return exactly the same ids. There seems to be a small difference between tokenizers BPE and llama.cpp BPE.
So how to get correct results from hf tokenizers?
HF tokenizer is correct as i showed above.
Awesome. Thanks for confirming quickly. Rechecked myself with this small program:
#!/usr/bin/env python3
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "CohereForAI/c4ai-command-r-v01"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
print(tokenizer.encode("hello\n\nworld")) # -> [5, 34313, 206, 206, 17080]
print(tokenizer.encode("hello\n\n world")) # -> [5, 34313, 2126, 3845]
# 5 = bos
print(repr(tokenizer.decode([206]))) # -> '\n'
print(repr(tokenizer.decode([17080]))) # -> 'world'
print(repr(tokenizer.decode([2126]))) # -> '\n\n'
print(repr(tokenizer.decode([3845]))) # -> ' world'
I had not seen 2126 come out of the model before for \n\n; always got 206 206 instead and thought it might be buggy since llama.cpp likes to give that and it made sense. If the tokenizer is the same as used in training, and 2126 does come out appropriately, there is no bug. Despite llama.cpp and HF disagreeing, the tokenizations in both do decode back to the same original strings. But I might take this back to llama.cpp
to see if we should modify the tokenizer in llama.cpp when we load this model.
Thanks for the model :) I've had a lot of fun with it.
I've opened this as an issue on llama.cpp
side for anyone interested: https://github.com/ggerganov/llama.cpp/issues/6104
I have ETA some time next week to investigate what's off exactly.
Thanks again, amazing work integrating our model to llama.cpp ❤️