File size: 1,809 Bytes
a7eafac
 
a355e53
 
 
 
9d7f3d3
a7eafac
9d7f3d3
a7eafac
9d7f3d3
a7eafac
9d7f3d3
a7eafac
9d7f3d3
 
a7eafac
9d7f3d3
a7eafac
9d7f3d3
 
 
 
a7eafac
9d7f3d3
 
a7eafac
9d7f3d3
 
a7eafac
9d7f3d3
a7eafac
 
9d7f3d3
a7eafac
9d7f3d3
 
 
a7eafac
9d7f3d3
 
 
a7eafac
9d7f3d3
 
a7eafac
9d7f3d3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
---
library_name: transformers
tags:
- c4ai-command-r-v01
- chat-template
- cohere
---

# Chat Template Tokenizer for c4ai-command-r-v01

This repository includes a fast tokenizer for [CohereForAI/c4ai-command-r-v01](https://huggingface.co/CohereForAI/c4ai-command-r-v01) with the Chat Template. The Tokenizer was created by replacing the string values of original tokens with id `255000` (`<|START_OF_TURN_TOKEN|>`) and `255001` (`<|END_OF_TURN_TOKEN|>`) with the role tokens `<|SYSTEM_TOKEN|>`, `<|USER_TOKEN|>` and `<|CHATBOT_TOKEN|>`. 

No new tokens were added during that process to ensure that the original model's embedding doesn't need to be modified. 

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("prince-canuma/c4ai-command-r-v01-tokenizer-chat-template")

messages = [
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
]

chatml = tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)
print(chatml)

# <|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN |>
# <|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>Hello! I'm doing well, thank you for asking! I'm excited to assist you and I'm looking forward to hearing your questions. How can I help you today?<| END_OF_TURN_TOKE NI>

```


## Test

```python
tokenizer = AutoTokenizer.from_pretrained("prince-canuma/c4ai-command-r-v01-tokenizer-chat-template")
original_tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")

# get special tokens
print(tokenizer.special_tokens_map)
print(original_tokenizer.special_tokens_map)

# check length of vocab
assert len(tokenizer) == len(original_tokenizer), "tokenizer are not having the same length"

```