Kannada Tokenizer
This is a Byte-Pair Encoding (BPE) tokenizer trained specifically for the Kannada language using the translated_output
column from the Cognitive-Lab/Kannada-Instruct-dataset. It is suitable for various Natural Language Processing (NLP) tasks involving Kannada text.
Model Description
- Model Type: Byte-Pair Encoding (BPE) Tokenizer
- Language: Kannada (
kn
) - Vocabulary Size: 32,000
- Special Tokens:
[UNK]
(Unknown token)[PAD]
(Padding token)[CLS]
(Classifier token)[SEP]
(Separator token)[MASK]
(Masking token)
- License: MIT License
- Dataset Used: Cognitive-Lab/Kannada-Instruct-dataset
- Algorithm: Byte-Pair Encoding (BPE)
Intended Use
This tokenizer is intended for NLP applications involving the Kannada language, such as:
- Language Modeling
- Text Generation
- Text Classification
- Machine Translation
- Named Entity Recognition
- Question Answering
- Summarization
How to Use
You can load the tokenizer directly from the Hugging Face Hub:
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("charanhu/kannada-tokenizer")
# Example usage
text = "ನೀವು ಹೇಗಿದ್ದೀರಿ?"
encoding = tokenizer.encode(text)
tokens = tokenizer.convert_ids_to_tokens(encoding)
decoded_text = tokenizer.decode(encoding)
print("Original Text:", text)
print("Tokens:", tokens)
print("Decoded Text:", decoded_text)
Output:
Original Text: ನೀವು ಹೇಗಿದ್ದೀರಿ?
Tokens: ['ನೀವು', 'ಹೇಗಿದ್ದೀರಿ', '?']
Decoded Text: ನೀವು ಹೇಗಿದ್ದೀರಿ?
Training Data
The tokenizer was trained on the translated_output
column from the Cognitive-Lab/Kannada-Instruct-dataset. This dataset contains translated instructions and responses in Kannada, providing a rich corpus for effective tokenization.
- Dataset Size: The dataset includes a significant number of entries covering a wide range of topics and linguistic structures in Kannada.
- Data Preprocessing: Text normalization was applied using NFKC normalization to standardize characters.
Training Procedure
- Normalization: NFKC normalization was used to handle canonical decomposition and compatibility decomposition, ensuring that characters are represented consistently.
- Pre-tokenization: The text was pre-tokenized using whitespace splitting.
- Tokenizer Algorithm: Byte-Pair Encoding (BPE) was chosen for its effectiveness in handling subword units, which is beneficial for languages with rich morphology like Kannada.
- Vocabulary Size: Set to 32,000 to balance between coverage and efficiency.
- Special Tokens: Included
[UNK]
,[PAD]
,[CLS]
,[SEP]
,[MASK]
to support various downstream tasks. - Training Library: The tokenizer was built using the Hugging Face Tokenizers library.
Evaluation
The tokenizer was qualitatively evaluated on a set of Kannada sentences to ensure reasonable tokenization. However, quantitative evaluation metrics such as tokenization efficiency or perplexity were not computed.
Limitations
- Vocabulary Coverage: While the tokenizer is trained on a diverse dataset, it may not include all possible words or phrases in Kannada, especially rare or domain-specific terms.
- Biases: The tokenizer inherits any biases present in the training data. Users should be cautious when applying it to sensitive or critical applications.
- Out-of-Vocabulary Words: Out-of-vocabulary words may be broken into subword tokens or mapped to the
[UNK]
token, which could affect performance in downstream tasks.
Ethical Considerations
- Data Privacy: The dataset used is publicly available, and care was taken to ensure that no personal or sensitive information is included.
- Bias Mitigation: No specific bias mitigation techniques were applied. Users should be aware of potential biases in the tokenizer due to the training data.
Recommendations
- Fine-tuning: For best results in specific applications, consider fine-tuning language models with this tokenizer on domain-specific data.
- Evaluation: Users should evaluate the tokenizer in their specific context to ensure it meets their requirements.
Acknowledgments
- Dataset: Thanks to Cognitive-Lab for providing the Kannada-Instruct-dataset.
- Libraries:
License
This tokenizer is released under the MIT License.
Citation
If you use this tokenizer in your research or applications, please consider citing it:
@misc{kannada_tokenizer_2023,
title={Kannada Tokenizer},
author={charanhu},
year={2023},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/charanhu/kannada-tokenizer}},
}