Turkish WordPiece Tokenizer

This repository contains a WordPiece tokenizer specifically trained on 1 billion Turkish sentences, making it highly suitable for natural language processing (NLP) tasks in the Turkish language. The tokenizer has been built using the tokenizers library and includes both cased and uncased versions for flexibility.

Repository Structure

File Name Description
special_tokens_map.json Maps special tokens like [UNK], [PAD], [CLS], and [SEP] to their respective identifiers.
tokenizer_config.json Contains configuration details for the tokenizer, including model type and special token settings.
turkish_wordpiece_tokenizer.json The primary WordPiece tokenizer trained on 1 billion Turkish sentences (cased).
turkish_wordpiece_tokenizer_uncased.json The uncased version of the WordPiece tokenizer.
turkish_wordpiece_tokenizer_post_token_uncased.json The post-tokenization configuration for the uncased tokenizer.

Features

  • WordPiece Tokenization: Breaks words into subword units for better handling of rare or unseen words.
  • Support for Cased and Uncased Text: Includes separate tokenizers for preserving case sensitivity and ignoring case.
  • Optimized for Turkish: Trained on a large-scale Turkish dataset (1 billion sentences), ensuring strong coverage of Turkish vocabulary and grammar.
  • Special Tokens: Includes commonly used tokens such as:
    • [UNK] (unknown token)
    • [PAD] (padding token)
    • [CLS] (classification token)
    • [SEP] (separator token)

Usage

To use the tokenizer, you can load it with the Hugging Face transformers library or the tokenizers library.

Loading with tokenizers:

from tokenizers import Tokenizer

# Load the uncased tokenizer
tokenizer = Tokenizer.from_file("path/to/turkish_wordpiece_tokenizer_uncased.json")

# Tokenize a sentence
output = tokenizer.encode("Merhaba dünya!")
print(output.tokens)

Tokenizer Training Details

  • Dataset: 1 billion Turkish sentences, sourced from diverse domains (news, social media, literature, etc.).
  • Model: WordPiece tokenizer, trained with a vocabulary size suitable for the Turkish language.
  • Uncased Variant: Lowercases all text during tokenization to ignore case distinctions.

Applications

  • Text Classification
  • Machine Translation
  • Question Answering
  • Text Summarization
  • Named Entity Recognition (NER)

Citation

If you use this tokenizer in your research or applications, please cite it as follows:

@misc{turkish_wordpiece_tokenizer,
  title={Turkish WordPiece Tokenizer},
  author={Mert Cobanov},
  year={2024},
  url={https://huggingface.co/mertcobanov/turkish-wordpiece-tokenizer}
}

Contributions

Contributions are welcome! If you have suggestions or improvements, please create an issue or submit a pull request.

Let me know if you'd like further adjustments!

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .