Turkish WordPiece Tokenizer
This repository contains a WordPiece tokenizer specifically trained on 1 billion Turkish sentences, making it highly suitable for natural language processing (NLP) tasks in the Turkish language. The tokenizer has been built using the tokenizers
library and includes both cased and uncased versions for flexibility.
Repository Structure
File Name | Description |
---|---|
special_tokens_map.json |
Maps special tokens like [UNK] , [PAD] , [CLS] , and [SEP] to their respective identifiers. |
tokenizer_config.json |
Contains configuration details for the tokenizer, including model type and special token settings. |
turkish_wordpiece_tokenizer.json |
The primary WordPiece tokenizer trained on 1 billion Turkish sentences (cased). |
turkish_wordpiece_tokenizer_uncased.json |
The uncased version of the WordPiece tokenizer. |
turkish_wordpiece_tokenizer_post_token_uncased.json |
The post-tokenization configuration for the uncased tokenizer. |
Features
- WordPiece Tokenization: Breaks words into subword units for better handling of rare or unseen words.
- Support for Cased and Uncased Text: Includes separate tokenizers for preserving case sensitivity and ignoring case.
- Optimized for Turkish: Trained on a large-scale Turkish dataset (1 billion sentences), ensuring strong coverage of Turkish vocabulary and grammar.
- Special Tokens: Includes commonly used tokens such as:
[UNK]
(unknown token)[PAD]
(padding token)[CLS]
(classification token)[SEP]
(separator token)
Usage
To use the tokenizer, you can load it with the Hugging Face transformers
library or the tokenizers
library.
Loading with tokenizers
:
from tokenizers import Tokenizer
# Load the uncased tokenizer
tokenizer = Tokenizer.from_file("path/to/turkish_wordpiece_tokenizer_uncased.json")
# Tokenize a sentence
output = tokenizer.encode("Merhaba dünya!")
print(output.tokens)
Tokenizer Training Details
- Dataset: 1 billion Turkish sentences, sourced from diverse domains (news, social media, literature, etc.).
- Model: WordPiece tokenizer, trained with a vocabulary size suitable for the Turkish language.
- Uncased Variant: Lowercases all text during tokenization to ignore case distinctions.
Applications
- Text Classification
- Machine Translation
- Question Answering
- Text Summarization
- Named Entity Recognition (NER)
Citation
If you use this tokenizer in your research or applications, please cite it as follows:
@misc{turkish_wordpiece_tokenizer,
title={Turkish WordPiece Tokenizer},
author={Mert Cobanov},
year={2024},
url={https://huggingface.co/mertcobanov/turkish-wordpiece-tokenizer}
}
Contributions
Contributions are welcome! If you have suggestions or improvements, please create an issue or submit a pull request.
Let me know if you'd like further adjustments!