|
--- |
|
library_name: transformers |
|
tags: |
|
- c4ai-command-r-v01 |
|
- chat-template |
|
- cohere |
|
--- |
|
|
|
# Chat Template Tokenizer for c4ai-command-r-v01 |
|
|
|
This repository includes a fast tokenizer for [CohereForAI/c4ai-command-r-v01](https://huggingface.co/CohereForAI/c4ai-command-r-v01) with the Chat Template. The Tokenizer was created by replacing the string values of original tokens with id `255000` (`<|START_OF_TURN_TOKEN|>`) and `255001` (`<|END_OF_TURN_TOKEN|>`) with the role tokens `<|SYSTEM_TOKEN|>`, `<|USER_TOKEN|>` and `<|CHATBOT_TOKEN|>`. |
|
|
|
No new tokens were added during that process to ensure that the original model's embedding doesn't need to be modified. |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("prince-canuma/c4ai-command-r-v01-tokenizer-chat-template") |
|
|
|
messages = [ |
|
{"role": "user", "content": "Hello, how are you?"}, |
|
{"role": "assistant", "content": "I'm doing great. How can I help you today?"}, |
|
] |
|
|
|
chatml = tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False) |
|
print(chatml) |
|
|
|
# <|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN |> |
|
# <|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>Hello! I'm doing well, thank you for asking! I'm excited to assist you and I'm looking forward to hearing your questions. How can I help you today?<| END_OF_TURN_TOKE NI> |
|
|
|
``` |
|
|
|
|
|
## Test |
|
|
|
```python |
|
tokenizer = AutoTokenizer.from_pretrained("prince-canuma/c4ai-command-r-v01-tokenizer-chat-template") |
|
original_tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01") |
|
|
|
# get special tokens |
|
print(tokenizer.special_tokens_map) |
|
print(original_tokenizer.special_tokens_map) |
|
|
|
# check length of vocab |
|
assert len(tokenizer) == len(original_tokenizer), "tokenizer are not having the same length" |
|
|
|
``` |