tahaenesaslanturk
/

ts-corpus-bpe-32k-cased

Model card Files Files and versions Community

tahaenesaslanturk commited on Apr 19, 2024

Commit

06b1138

·

verified ·

1 Parent(s): 6cdb8d0

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 license: mit
 ---
-# TS-Corpus BPE Tokenizer (32k, Uncased)
 ## Overview
 This repository hosts a Byte Pair Encoding (BPE) tokenizer with a vocabulary size of 32,000, trained uncased using several datasets from the TS Corpus website. The BPE method is particularly effective for languages like Turkish, providing a balance between word-level and character-level tokenization.
@@ -18,11 +18,11 @@ The tokenizer was trained on a variety of text sources from TS Corpus, ensuring
 The inclusion of idiomatic expressions, proverbs, and legal terminology provides a comprehensive toolkit for processing Turkish text across different domains.
 ## Tokenizer Model
-Utilizing the Byte Pair Encoding (BPE) method, this tokenizer excels in efficiently managing subword units without the need for an extensive vocabulary. BPE is especially suitable for handling the agglutinative nature of Turkish, where words can have multiple suffixes. This uncased version normalizes input text by converting all characters to lowercase, which simplifies processing and improves consistency.
 ## Usage
 To use this tokenizer in your projects, load it with the Hugging Face `transformers` library:
 ```python
 from transformers import AutoTokenizer
-tokenizer = AutoTokenizer.from_pretrained("tahaenesaslanturk/ts-corpus-bpe-32k-uncased")
 ```

 ---
 license: mit
 ---
+# TS-Corpus BPE Tokenizer (32k, Cased)
 ## Overview
 This repository hosts a Byte Pair Encoding (BPE) tokenizer with a vocabulary size of 32,000, trained uncased using several datasets from the TS Corpus website. The BPE method is particularly effective for languages like Turkish, providing a balance between word-level and character-level tokenization.
 The inclusion of idiomatic expressions, proverbs, and legal terminology provides a comprehensive toolkit for processing Turkish text across different domains.
 ## Tokenizer Model
+Utilizing the Byte Pair Encoding (BPE) method, this tokenizer excels in efficiently managing subword units without the need for an extensive vocabulary. BPE is especially suitable for handling the agglutinative nature of Turkish, where words can have multiple suffixes.
 ## Usage
 To use this tokenizer in your projects, load it with the Hugging Face `transformers` library:
 ```python
 from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("tahaenesaslanturk/ts-corpus-bpe-32k-cased")
 ```