File size: 11,262 Bytes

4850762
 
475cbf4
02bcb59
475cbf4
 
4850762
 
475cbf4
 
d85edee
 
 
 
4850762
f7020ed
 
eb4dc8d
 
 
 
ad596d3
eb4dc8d
 
 
 
 
 
 
 
 
 
9a27cc1
eb4dc8d
 
 
 
caa3508
 
 
 
8ee44c2
 
 
 
 
 
 
 
 
 
 
 
 
caa3508
475cbf4
 
 
 
 
 
4850762
475cbf4
4017290
475cbf4
4850762
475cbf4
 
 
eb4dc8d
181e541
475cbf4
 
 
 
4850762
475cbf4
 
 
4850762
d85edee

---
library_name: transformers
license: apache-2.0
pipeline_tag: automatic-speech-recognition
tags:
- audio
---

# Cascaded English Speech2Text Translation
This is a pipeline for speech-to-text translation from English speech to any target language text based on the cascaded approach, that consists of ASR and translation.
The pipeline employs [distil-whisper/distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3) for ASR (English speech -> English text)
and [facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) for text translation.
The input must be English speech, while the translation can be in any languages NLLB trained on. Please find the all available languages and their language codes
[here](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200).

**Model for Japanese speech translation is available at [ja-cascaded-s2t-translation](https://huggingface.co/japanese-asr/ja-cascaded-s2t-translation).**

## Benchmark
The folloiwng table shows CER computed over the reference and predicted translation for translating English speech to Japanese text task
(subsets of [CoVoST2 and Fleurs](https://huggingface.co/datasets/japanese-asr/en2ja.s2t_translation)) with different size of NLLB along with OpenAI Whisper models.

| model                                                                                                                                                                                                     |   [CoVoST2 (En->Ja)](https://huggingface.co/datasets/japanese-asr/en2ja.s2t_translation)|   [Fleurs (En->JA)](https://huggingface.co/datasets/japanese-asr/en2ja.s2t_translation) |
|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------:|
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B))                     |                                                                                                   62.4 |                                                                                                  63.5 |
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B))                     |                                                                                                   64.4 |                                                                                                  67.2 |
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)) |                                                                                                   62.4 |                                                                                                  62.9 |
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)) |                                                                                                   63.4 |                                                                                                  66.2 |
| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)                                                                                                                                 |                                                                                                  178.9 |                                                                                                 209.5 |
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2)                                                                                                                                 |                                                                                                  179.6 |                                                                                                 201.8 |
| [openai/whisper-large](https://huggingface.co/openai/whisper-large)                                                                                                                                       |                                                                                                  178.7 |                                                                                                 201.8 |
| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium)                                                                                                                                     |                                                                                                  178.7 |                                                                                                 202   |
| [openai/whisper-small](https://huggingface.co/openai/whisper-small)                                                                                                                                       |                                                                                                  178.9 |                                                                                                 206.8 |
| [openai/whisper-base](https://huggingface.co/openai/whisper-base)                                                                                                                                         |                                                                                                  179.5 |                                                                                                 214.2 |
| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)                                                                                                                                         |                                                                                                  185.2 |                                                                                                 200.5 | 

See [https://github.com/kotoba-tech/kotoba-whisper](https://github.com/kotoba-tech/kotoba-whisper) for the evaluation detail.

### Inference Speed
Due to the nature of cascaded approach, the pipeline has additional complexity compared to the single end2end OpenAI whisper models for the sake of high accuracy.
Following table shows the mean inference time in second averaged over 10 trials on audio sample with different durations.

| model                                                                                                                                                                                                     |    10 |    30 |    60 |   300 |
|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------:|------:|------:|------:|
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B))                     | 0.173 | 0.247 | 0.352 | 1.772 |
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B))                     | 0.173 | 0.24  | 0.348 | 1.515 |
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)) | 0.17  | 0.245 | 0.348 | 1.882 |
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)) | 0.108 | 0.179 | 0.283 | 1.33  |
| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)                                                                                                                                 | 0.061 | 0.184 | 0.372 | 1.804 |
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2)                                                                                                                                 | 0.062 | 0.199 | 0.415 | 1.854 |
| [openai/whisper-large](https://huggingface.co/openai/whisper-large)                                                                                                                                       | 0.062 | 0.183 | 0.363 | 1.899 |
| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium)                                                                                                                                     | 0.045 | 0.132 | 0.266 | 1.368 |
| [openai/whisper-small](https://huggingface.co/openai/whisper-small)                                                                                                                                       | 0.135 | 0.376 | 0.631 | 3.495 |
| [openai/whisper-base](https://huggingface.co/openai/whisper-base)                                                                                                                                         | 0.054 | 0.108 | 0.231 | 1.019 |
| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)                                                                                                                                         | 0.045 | 0.124 | 0.208 | 0.838 |

## Usage
Here is an example to translate English speech into Japanese text translation.
First, download a sample speech.
```bash
wget https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval/resolve/main/sample.wav -O sample_en.wav
```

Then, run the pipeline as below.
```python3
from transformers import pipeline

# load model
pipe = pipeline(
    model="japanese-asr/en-cascaded-s2t-translation",
    model_translation="facebook/nllb-200-distilled-600M",
    tgt_lang="jpn_Jpan",
    model_kwargs={"attn_implementation": "sdpa"},
    chunk_length_s=15,
    trust_remote_code=True,
)

# translate
output = pipe("./sample.wav")
```

Other NLLB models can be used by setting `model_translation` such as following.
- [facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B)
- [facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)
- [facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)
- [facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B)