prince-canuma
/

c4ai-command-r-v01-tokenizer-chat-template

c4ai-command-r-v01

Inference Endpoints

Model card Files Files and versions Community

c4ai-command-r-v01-tokenizer-chat-template / README.md

prince-canuma's picture

Upload tokenizer

a355e53 verified 10 months ago

|

1.81 kB

	---
	library_name: transformers
	tags:
	- c4ai-command-r-v01
	- chat-template
	- cohere
	---

	# Chat Template Tokenizer for c4ai-command-r-v01

	This repository includes a fast tokenizer for [CohereForAI/c4ai-command-r-v01](https://huggingface.co/CohereForAI/c4ai-command-r-v01) with the Chat Template. The Tokenizer was created by replacing the string values of original tokens with id `255000` (`<\|START_OF_TURN_TOKEN\|>`) and `255001` (`<\|END_OF_TURN_TOKEN\|>`) with the role tokens `<\|SYSTEM_TOKEN\|>`, `<\|USER_TOKEN\|>` and `<\|CHATBOT_TOKEN\|>`.

	No new tokens were added during that process to ensure that the original model's embedding doesn't need to be modified.

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("prince-canuma/c4ai-command-r-v01-tokenizer-chat-template")

	messages = [
	{"role": "user", "content": "Hello, how are you?"},
	{"role": "assistant", "content": "I'm doing great. How can I help you today?"},
	]

	chatml = tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)
	print(chatml)

	# <\|START_OF_TURN_TOKEN\|><\|USER_TOKEN\|>Hello, how are you?<\|END_OF_TURN_TOKEN \|>
	# <\|START_OF_TURN_TOKEN\|><\|CHATBOT_TOKEN\|>Hello! I'm doing well, thank you for asking! I'm excited to assist you and I'm looking forward to hearing your questions. How can I help you today?<\| END_OF_TURN_TOKE NI>

	```


	## Test

	```python
	tokenizer = AutoTokenizer.from_pretrained("prince-canuma/c4ai-command-r-v01-tokenizer-chat-template")
	original_tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")

	# get special tokens
	print(tokenizer.special_tokens_map)
	print(original_tokenizer.special_tokens_map)

	# check length of vocab
	assert len(tokenizer) == len(original_tokenizer), "tokenizer are not having the same length"

	```