Sigurdur commited on
Commit
9d3ccf4
verified
1 Parent(s): 9fa2012

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -0
README.md ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - is
4
+ library_name: transformers
5
+ ---
6
+ # Icebreaker tokenizer
7
+
8
+ This is a BPE tokenizer trained on the Iceladic Gigaword Corpus, News 1. The tokenizer can be used for training Icelandic language models.
9
+
10
+ ## Model Details
11
+
12
+ BPE tokenizer, trained on the first 242553 files in the News 1 IGC 2022, unnanotated dataset by Arnastofnun.
13
+
14
+ ### Model Description
15
+
16
+ It has a vocab size of 3200.
17
+
18
+
19
+ - **Developed by:** Sigurdur Haukur Birgisson
20
+ - **Model type:** GPT2Tokenizer
21
+ - **Language(s) (NLP):** Icelandic
22
+
23
+ ### Model Sources
24
+
25
+ - **Repository:** https://github.com/sigurdurhaukur/tokenicer
26
+
27
+ ## How to Get Started with the Model
28
+
29
+ Use the code below to get started with the model.
30
+
31
+ ```
32
+ from transformers import AutoTokenizer
33
+
34
+ tokenizer = AutoTokenizer.from_pretrained("Sigurdur/icebreaker")
35
+ tokens = tokenizer("Hall贸 heimur!")
36
+ ```
37
+
38
+ ## Model Card Contact
39
+
40
+ Sigurdur Haukur Birgissson: [email protected]