yukiakai sheonhan commited on
Commit
f6c211f
·
verified ·
0 Parent(s):

Duplicate from facebook/fasttext-language-identification

Browse files

Co-authored-by: Sheon Han <[email protected]>

Files changed (3) hide show
  1. .gitattributes +34 -0
  2. README.md +141 -0
  3. model.bin +3 -0
.gitattributes ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tflite filter=lfs diff=lfs merge=lfs -text
29
+ *.tgz filter=lfs diff=lfs merge=lfs -text
30
+ *.wasm filter=lfs diff=lfs merge=lfs -text
31
+ *.xz filter=lfs diff=lfs merge=lfs -text
32
+ *.zip filter=lfs diff=lfs merge=lfs -text
33
+ *.zst filter=lfs diff=lfs merge=lfs -text
34
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ library_name: fasttext
4
+ tags:
5
+ - text-classification
6
+ - language-identification
7
+ ---
8
+
9
+ # fastText (Language Identification)
10
+
11
+ fastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices. It was introduced in [this paper](https://arxiv.org/abs/1607.04606). The official website can be found [here](https://fasttext.cc/).
12
+
13
+ This LID (Language IDentification) model is used to predict the language of the input text, and the hosted version (`lid218e`) was [released as part of the NLLB project](https://github.com/facebookresearch/fairseq/blob/nllb/README.md#lid-model) and can detect 217 languages. You can find older versions (ones that can identify 157 languages) on the [official fastText website](https://fasttext.cc/docs/en/language-identification.html).
14
+
15
+ ## Model description
16
+
17
+ fastText is a library for efficient learning of word representations and sentence classification. fastText is designed to be simple to use for developers, domain experts, and students. It's dedicated to text classification and learning word representations, and was designed to allow for quick model iteration and refinement without specialized hardware. fastText models can be trained on more than a billion words on any multicore CPU in less than a few minutes.
18
+
19
+ It includes pre-trained models learned on Wikipedia and in over 157 different languages. fastText can be used as a command line, linked to a C++ application, or used as a library for use cases from experimentation and prototyping to production.
20
+
21
+ ## Intended uses & limitations
22
+
23
+ You can use pre-trained word vectors for text classification or language identification. See the [tutorials](https://fasttext.cc/docs/en/supervised-tutorial.html) and [resources](https://fasttext.cc/docs/en/english-vectors.html) on its official website to look for tasks that interest you.
24
+
25
+ ### How to use
26
+
27
+ Here is how to use this model to detect the language of a given text:
28
+
29
+ ```python
30
+ >>> import fasttext
31
+ >>> from huggingface_hub import hf_hub_download
32
+
33
+ >>> model_path = hf_hub_download(repo_id="facebook/fasttext-language-identification", filename="model.bin")
34
+ >>> model = fasttext.load_model(model_path)
35
+ >>> model.predict("Hello, world!")
36
+
37
+ (('__label__eng_Latn',), array([0.81148803]))
38
+
39
+ >>> model.predict("Hello, world!", k=5)
40
+
41
+ (('__label__eng_Latn', '__label__vie_Latn', '__label__nld_Latn', '__label__pol_Latn', '__label__deu_Latn'),
42
+ array([0.61224753, 0.21323682, 0.09696738, 0.01359863, 0.01319415]))
43
+ ```
44
+
45
+ ### Limitations and bias
46
+
47
+ Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions.
48
+
49
+ Cosine similarity can be used to measure the similarity between two different word vectors. If two two vectors are identical, the cosine similarity will be 1. For two completely unrelated vectors, the value will be 0. If two vectors have an opposite relationship, the value will be -1.
50
+
51
+ ```python
52
+ >>> import numpy as np
53
+
54
+ >>> def cosine_similarity(word1, word2):
55
+ >>> return np.dot(model[word1], model[word2]) / (np.linalg.norm(model[word1]) * np.linalg.norm(model[word2]))
56
+
57
+ >>> cosine_similarity("man", "boy")
58
+
59
+ 0.061653383
60
+
61
+ >>> cosine_similarity("man", "ceo")
62
+
63
+ 0.11989131
64
+
65
+ >>> cosine_similarity("woman", "ceo")
66
+
67
+ -0.08834904
68
+ ```
69
+
70
+ ## Training data
71
+
72
+ Pre-trained word vectors for 157 languages were trained on [Common Crawl](http://commoncrawl.org/) and [Wikipedia](https://www.wikipedia.org/) using fastText. These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. We also distribute three new word analogy datasets, for French, Hindi and Polish.
73
+
74
+ ## Training procedure
75
+
76
+ ### Tokenization
77
+
78
+ We used the [Stanford word segmenter](https://nlp.stanford.edu/software/segmenter.html) for Chinese, [Mecab](http://taku910.github.io/mecab/) for Japanese and [UETsegmenter](https://github.com/phongnt570/UETsegmenter) for Vietnamese. For languages using the Latin, Cyrillic, Hebrew or Greek scripts, we used the tokenizer from the [Europarl](https://www.statmt.org/europarl/) preprocessing tools. For the remaining languages, we used the ICU tokenizer.
79
+
80
+ More information about the training of these models can be found in the article [Learning Word Vectors for 157 Languages](https://arxiv.org/abs/1802.06893).
81
+
82
+ ### License
83
+
84
+ The language identification model is distributed under the [*Creative Commons Attribution-NonCommercial 4.0 International Public License*](https://creativecommons.org/licenses/by-nc/4.0/).
85
+
86
+ ### Evaluation datasets
87
+
88
+ The analogy evaluation datasets described in the paper are available here: [French](https://dl.fbaipublicfiles.com/fasttext/word-analogies/questions-words-fr.txt), [Hindi](https://dl.fbaipublicfiles.com/fasttext/word-analogies/questions-words-hi.txt), [Polish](https://dl.fbaipublicfiles.com/fasttext/word-analogies/questions-words-pl.txt).
89
+
90
+ ### BibTeX entry and citation info
91
+
92
+ Please cite [1] if using this code for learning word representations or [2] if using for text classification.
93
+
94
+ [1] P. Bojanowski\*, E. Grave\*, A. Joulin, T. Mikolov, [*Enriching Word Vectors with Subword Information*](https://arxiv.org/abs/1607.04606)
95
+
96
+ ```markup
97
+ @article{bojanowski2016enriching,
98
+ title={Enriching Word Vectors with Subword Information},
99
+ author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
100
+ journal={arXiv preprint arXiv:1607.04606},
101
+ year={2016}
102
+ }
103
+ ```
104
+
105
+ [2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/abs/1607.01759)
106
+
107
+ ```markup
108
+ @article{joulin2016bag,
109
+ title={Bag of Tricks for Efficient Text Classification},
110
+ author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
111
+ journal={arXiv preprint arXiv:1607.01759},
112
+ year={2016}
113
+ }
114
+ ```
115
+
116
+ [3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, [*FastText.zip: Compressing text classification models*](https://arxiv.org/abs/1612.03651)
117
+
118
+ ```markup
119
+ @article{joulin2016fasttext,
120
+ title={FastText.zip: Compressing text classification models},
121
+ author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{'e}gou, H{'e}rve and Mikolov, Tomas},
122
+ journal={arXiv preprint arXiv:1612.03651},
123
+ year={2016}
124
+ }
125
+ ```
126
+
127
+ If you use these word vectors, please cite the following paper:
128
+
129
+ [4] E. Grave\*, P. Bojanowski\*, P. Gupta, A. Joulin, T. Mikolov, [*Learning Word Vectors for 157 Languages*](https://arxiv.org/abs/1802.06893)
130
+
131
+ ```markup
132
+ @inproceedings{grave2018learning,
133
+ title={Learning Word Vectors for 157 Languages},
134
+ author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},
135
+ booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018)},
136
+ year={2018}
137
+ }
138
+ ```
139
+
140
+ (\* These authors contributed equally.)
141
+
model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8ded5749a2ad79ae9ab7c9190c7c8b97ff20d54ad8b9527ffa50107238fc7f6a
3
+ size 1176355829