|
------------------------------------------------------------------------------- |
|
DeepSpeech Scorer for Icelandic 22.06 |
|
------------------------------------------------------------------------------- |
|
|
|
Authors : Carlos Daniel Hernández Mena ([email protected]). |
|
|
|
Language : Icelandic. |
|
|
|
Recommended use : speech recognition. |
|
|
|
------------------------------------------------------------------------------- |
|
Description |
|
------------------------------------------------------------------------------- |
|
|
|
"DeepSpeech Scorer for Icelandic 22.06" is a scorer suitable for recognizers |
|
based on the Mozilla's DeepSpeech recognizer [1]. A "scorer" is a single file |
|
used to perform language modeling. It is composed of two sub-components, a |
|
KenLM language model and a trie data structure containing all words in the |
|
vocabulary [2]. |
|
|
|
This scorer was originally created to be used with the following DeepSpeech |
|
recipe, developed by the Language and Voice Lab (LVL) at Reykjavík University |
|
in 2022: |
|
|
|
https://github.com/cadia-lvl/samromur-asr/tree/d5_samromur/d5_samromur |
|
|
|
Nevertheless, due to the flexibility of this kind of resources and their |
|
possible application in other tasks, systems or code recipes; it was |
|
decided to publish this resource as an independent item. |
|
|
|
------------------------------------------------------------------------------- |
|
The Language Model |
|
------------------------------------------------------------------------------- |
|
|
|
The language model was created using the Icelandic Gigaword Corpus [3]. The |
|
Gigaword corpus contains text from newspaper articles, parliamentary speeches, |
|
adjudications, books, transcribed radio/television news and more. The |
|
normalization process of the sentences utilized to generate the language |
|
model includes to allowing only characters belonging to the Icelandic alphabet, |
|
expanding numbers and abbreviations, and removing punctuation marks [4]. The |
|
resulting text has a length of more than 44 million lines of text (5.3GB |
|
approximately), and it was used to create the scorer. |
|
|
|
------------------------------------------------------------------------------- |
|
Citation |
|
------------------------------------------------------------------------------- |
|
|
|
When publishing results based on the models please refer to: |
|
|
|
Mena, Carlos; "DeepSpeech Scorer for Icelandic 22.06". Web Download. |
|
Reykjavik University: Language and Voice Lab, 2022. |
|
|
|
Contact: Carlos Mena ([email protected]) |
|
|
|
License: CC BY 4.0 |
|
|
|
------------------------------------------------------------------------------- |
|
Acknowledgements |
|
------------------------------------------------------------------------------- |
|
|
|
This initiative was funded by the Language Technology Programme for Icelandic |
|
2019-2023. The programme, which is managed and coordinated by Almannarómur, |
|
is funded by the Icelandic Ministry of Education, Science and Culture. |
|
|
|
------------------------------------------------------------------------------- |
|
References |
|
------------------------------------------------------------------------------- |
|
|
|
[1] Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, |
|
E., Case, C., ... & Zhu, Z. (2016, June). Deep speech 2: End-to-end |
|
speech recognition in english and mandarin. In International conference |
|
on machine learning (pp. 173-182). PMLR. |
|
|
|
[2] Mozilla's DeepSpeech online documentation: |
|
https://deepspeech.readthedocs.io/en/r0.9/Scorer.html |
|
|
|
[3] Steingrímsson, S., Helgadóttir, S., Rögnvaldsson, E., Barkarson, S., |
|
& Guðnason, J. (2018, May). Risamálheild: A very large Icelandic text |
|
corpus. In Proceedings of the Eleventh International Conference on |
|
Language Resources and Evaluation (LREC 2018). |
|
|
|
[4] Nikulásdóttir, A. B., Helgadóttir, I. R., Pétursson, M., & Guðnason, |
|
J. (2018, May). Open ASR for Icelandic: Resources and a baseline system. |
|
In Proceedings of the Eleventh International Conference on Language |
|
Resources and Evaluation (LREC 2018). |
|
|
|
------------------------------------------------------------------------------- |
|
------------------------------------------------------------------------------- |
|
|
|
|