prince-canuma's picture
Upload tokenizer
a355e53 verified
metadata
library_name: transformers
tags:
  - c4ai-command-r-v01
  - chat-template
  - cohere

Chat Template Tokenizer for c4ai-command-r-v01

This repository includes a fast tokenizer for CohereForAI/c4ai-command-r-v01 with the Chat Template. The Tokenizer was created by replacing the string values of original tokens with id 255000 (<|START_OF_TURN_TOKEN|>) and 255001 (<|END_OF_TURN_TOKEN|>) with the role tokens <|SYSTEM_TOKEN|>, <|USER_TOKEN|> and <|CHATBOT_TOKEN|>.

No new tokens were added during that process to ensure that the original model's embedding doesn't need to be modified.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("prince-canuma/c4ai-command-r-v01-tokenizer-chat-template")

messages = [
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
]

chatml = tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)
print(chatml)

# <|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN |>
# <|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>Hello! I'm doing well, thank you for asking! I'm excited to assist you and I'm looking forward to hearing your questions. How can I help you today?<| END_OF_TURN_TOKE NI>

Test

tokenizer = AutoTokenizer.from_pretrained("prince-canuma/c4ai-command-r-v01-tokenizer-chat-template")
original_tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")

# get special tokens
print(tokenizer.special_tokens_map)
print(original_tokenizer.special_tokens_map)

# check length of vocab
assert len(tokenizer) == len(original_tokenizer), "tokenizer are not having the same length"