shovalmessica
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -7,7 +7,7 @@ Official implementation of [NAST: Noise Aware Speech Tokenization for Speech Lan
|
|
7 |
[![arXiv](https://img.shields.io/badge/arXiv-2406.11037-green.svg)](https://arxiv.org/abs/2406.11037)
|
8 |
|
9 |
<p align="center">
|
10 |
-
<img src="
|
11 |
</p>
|
12 |
|
13 |
<b>Abstract:</b> Speech tokenization is the task of representing speech signals as a sequence of discrete units. Such representations can be later used for various downstream tasks including automatic speech recognition, text-to-speech, etc. More relevant to this study, such representation serves as the basis of Speech Language Models. In this work, we tackle the task of speech tokenization under the noisy setup and present NAST: Noise Aware Speech Tokenization for Speech Language Models. NAST is composed of three main components: (i) a predictor; (ii) a residual encoder; and (iii) a decoder. We evaluate the efficiency of NAST considering several speech language modeling tasks, and show that NAST is superior to the evaluated baselines across all setups. Lastly, we analyze NAST and show its disentanglement properties and robustness to signal variations in the form of noise, reverberation, pitch-shift, and time-stretch.
|
|
|
7 |
[![arXiv](https://img.shields.io/badge/arXiv-2406.11037-green.svg)](https://arxiv.org/abs/2406.11037)
|
8 |
|
9 |
<p align="center">
|
10 |
+
<img src="finanlfinal1.png" alt="diagram" style="width:55%;height:auto;"/>
|
11 |
</p>
|
12 |
|
13 |
<b>Abstract:</b> Speech tokenization is the task of representing speech signals as a sequence of discrete units. Such representations can be later used for various downstream tasks including automatic speech recognition, text-to-speech, etc. More relevant to this study, such representation serves as the basis of Speech Language Models. In this work, we tackle the task of speech tokenization under the noisy setup and present NAST: Noise Aware Speech Tokenization for Speech Language Models. NAST is composed of three main components: (i) a predictor; (ii) a residual encoder; and (iii) a decoder. We evaluate the efficiency of NAST considering several speech language modeling tasks, and show that NAST is superior to the evaluated baselines across all setups. Lastly, we analyze NAST and show its disentanglement properties and robustness to signal variations in the form of noise, reverberation, pitch-shift, and time-stretch.
|