File size: 1,264 Bytes
3b3dddc
 
 
 
 
 
 
 
 
 
 
3b9479d
 
 
 
 
3b3dddc
 
f880d37
3f5d46e
9e55bb0
e4d3e3d
24122a4
9318a3b
e4d3e3d
 
 
9318a3b
e4d3e3d
 
 
 
5fc5cc1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
---
tags:
- espnet
- audio
- automatic-speech-recognition
- speech-translation
- language-identification
language: multilingual
datasets:
- owsm_v3.1_ctc
license: cc-by-4.0
metrics:
- cer
- bleu
- accuracy
library_name: espnet
---

[OWSM-CTC](https://aclanthology.org/2024.acl-long.549/) (Peng et al., ACL 2024) is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC.
It is trained on 180k hours of public audio data for multilingual speech recognition, any-to-any speech translation, and language identification, which follows the design of the project, [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/).

Due to time constraint, the model used in the paper was trained for 40 "epochs". The new model trained for 45 "epochs" (approximately three entire passes on the full data) is also added in this repo in order to match the setup of encoder-decoder OWSM. It can have better performance than the old one in many test sets.

To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
```
librosa
torch
espnet
espnet_model_zoo
```


**Example usage can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1