tahaenesaslanturk
/

ts-corpus-bpe-32k-cased

Model card Files Files and versions Community

ts-corpus-bpe-32k-cased / README.md

tahaenesaslanturk's picture

tahaenesaslanturk

Update README.md

06b1138 verified 9 months ago

|

history blame contribute delete

1.68 kB

	---
	license: mit
	---
	# TS-Corpus BPE Tokenizer (32k, Cased)

	## Overview
	This repository hosts a Byte Pair Encoding (BPE) tokenizer with a vocabulary size of 32,000, trained uncased using several datasets from the TS Corpus website. The BPE method is particularly effective for languages like Turkish, providing a balance between word-level and character-level tokenization.

	## Dataset Sources
	The tokenizer was trained on a variety of text sources from TS Corpus, ensuring a broad linguistic coverage. These sources include:
	- [TS Corpus V2](https://tscorpus.com/corpora/ts-corpus-v2/)
	- [TS Wikipedia Corpus](https://tscorpus.com/corpora/ts-wikipedia-corpus/)
	- [TS Abstract Corpus](https://tscorpus.com/corpora/ts-abstract-corpus/)
	- [TS Idioms and Proverbs Corpus](https://tscorpus.com/corpora/ts-idioms-and-proverbs-corpus/)
	- [Syllable Corpus](https://tscorpus.com/corpora/syllable-corpus/)
	- [Turkish Constitution Corpus](https://tscorpus.com/corpora/turkish-constitution-corpus/)

	The inclusion of idiomatic expressions, proverbs, and legal terminology provides a comprehensive toolkit for processing Turkish text across different domains.

	## Tokenizer Model
	Utilizing the Byte Pair Encoding (BPE) method, this tokenizer excels in efficiently managing subword units without the need for an extensive vocabulary. BPE is especially suitable for handling the agglutinative nature of Turkish, where words can have multiple suffixes.

	## Usage
	To use this tokenizer in your projects, load it with the Hugging Face `transformers` library:
	```python
	from transformers import AutoTokenizer
	tokenizer = AutoTokenizer.from_pretrained("tahaenesaslanturk/ts-corpus-bpe-32k-cased")
	```