metadata
language:
- is
library_name: transformers
Icebreaker tokenizer
This is a BPE tokenizer trained on the Iceladic Gigaword Corpus, News 1. The tokenizer can be used for training Icelandic language models.
Model Details
BPE tokenizer, trained on the first 242553 files in the News 1 IGC 2022, unnanotated dataset by Arnastofnun.
Model Description
It has a vocab size of 3200.
- Developed by: Sigurdur Haukur Birgisson
- Model type: GPT2Tokenizer
- Language(s) (NLP): Icelandic
Model Sources
- Repository: https://github.com/sigurdurhaukur/tokenicer
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Sigurdur/icebreaker")
tokens = tokenizer("Halló heimur!")
Model Card Contact
Sigurdur Haukur Birgissson: [email protected]