File size: 1,809 Bytes
a7eafac a355e53 9d7f3d3 a7eafac 9d7f3d3 a7eafac 9d7f3d3 a7eafac 9d7f3d3 a7eafac 9d7f3d3 a7eafac 9d7f3d3 a7eafac 9d7f3d3 a7eafac 9d7f3d3 a7eafac 9d7f3d3 a7eafac 9d7f3d3 a7eafac 9d7f3d3 a7eafac 9d7f3d3 a7eafac 9d7f3d3 a7eafac 9d7f3d3 a7eafac 9d7f3d3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
---
library_name: transformers
tags:
- c4ai-command-r-v01
- chat-template
- cohere
---
# Chat Template Tokenizer for c4ai-command-r-v01
This repository includes a fast tokenizer for [CohereForAI/c4ai-command-r-v01](https://huggingface.co/CohereForAI/c4ai-command-r-v01) with the Chat Template. The Tokenizer was created by replacing the string values of original tokens with id `255000` (`<|START_OF_TURN_TOKEN|>`) and `255001` (`<|END_OF_TURN_TOKEN|>`) with the role tokens `<|SYSTEM_TOKEN|>`, `<|USER_TOKEN|>` and `<|CHATBOT_TOKEN|>`.
No new tokens were added during that process to ensure that the original model's embedding doesn't need to be modified.
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("prince-canuma/c4ai-command-r-v01-tokenizer-chat-template")
messages = [
{"role": "user", "content": "Hello, how are you?"},
{"role": "assistant", "content": "I'm doing great. How can I help you today?"},
]
chatml = tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)
print(chatml)
# <|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN |>
# <|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>Hello! I'm doing well, thank you for asking! I'm excited to assist you and I'm looking forward to hearing your questions. How can I help you today?<| END_OF_TURN_TOKE NI>
```
## Test
```python
tokenizer = AutoTokenizer.from_pretrained("prince-canuma/c4ai-command-r-v01-tokenizer-chat-template")
original_tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")
# get special tokens
print(tokenizer.special_tokens_map)
print(original_tokenizer.special_tokens_map)
# check length of vocab
assert len(tokenizer) == len(original_tokenizer), "tokenizer are not having the same length"
``` |