Sigurdur
/

icebreaker-tokenicer

Inference Endpoints

Model card Files Files and versions Community

Sigurdur commited on Mar 8, 2024

Commit

9d3ccf4

·

verified ·

1 Parent(s): 9fa2012

Create README.md

Files changed (1) hide show

README.md +40 -0

README.md ADDED Viewed

	@@ -0,0 +1,40 @@

+---
+language:
+- is
+library_name: transformers
+---
+# Icebreaker tokenizer
+This is a BPE tokenizer trained on the Iceladic Gigaword Corpus, News 1. The tokenizer can be used for training Icelandic language models.
+## Model Details
+BPE tokenizer, trained on the first 242553 files in the News 1 IGC 2022, unnanotated dataset by Arnastofnun.
+### Model Description
+It has a vocab size of 3200.
+- **Developed by:** Sigurdur Haukur Birgisson
+- **Model type:** GPT2Tokenizer
+- **Language(s) (NLP):** Icelandic
+### Model Sources
+- **Repository:** https://github.com/sigurdurhaukur/tokenicer
+## How to Get Started with the Model
+Use the code below to get started with the model.
+```
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("Sigurdur/icebreaker")
+tokens = tokenizer("Halló heimur!")
+```
+## Model Card Contact
+Sigurdur Haukur Birgissson: [email protected]