---
library_name: transformers
license: apache-2.0
language:
- th
pipeline_tag: automatic-speech-recognition
---

# Monsoon-Whisper-Medium-Gigaspeech2

**Monsoon-Whisper-Medium-GigaSpeech2** is a 🇹🇭 Thai *Automatic Speech Recognition* (ASR) model. It is based on [Whisper-Medium](https://huggingface.co/openai/whisper-medium) and fine-tuned on [GigaSpeech2](https://huggingface.co/datasets/speechcolab/gigaspeech2).

Originally developed as a scale experiment for research on emergent capabilities in ASR tasks. It performs well in the wild, including with audio sourced from YouTube and in noisy environments.

More details can be found in our [Typhoon-Audio Release Blog](https://blog.opentyphoon.ai/typhoon-audio-preview-release-6fbb3f938287).

## Model Description

- **Model type**: Whisper Medium.
- **Requirement**: transformers 4.38.0 or newer.
- **Primary Language(s)**: Thai 🇹🇭
- **License**: Apache 2.0

## Usage Example

```python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torchaudio
import torch

model_path = "scb10x/monsoon-whisper-medium-gigaspeech2"
device = "cuda"
filepath = 'audio.wav'

processor = WhisperProcessor.from_pretrained(model_path)
model = WhisperForConditionalGeneration.from_pretrained(
    model_path, torch_dtype=torch.bfloat16
)
model.to(device)
model.eval()

model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(
    language="th", task="transcribe"
)
array, sr = torchaudio.load(filepath)
input_features = (
    processor(array, sampling_rate=sr, return_tensors="pt")
    .to(device)
    .to(torch.bfloat16)
    .input_features
)
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription)
```

## Evaluation Results

| Model                                   | WER (GS2) | WER (CV17) | CER (GS2) | CER (CV17) |
|:----------------------------------------|:----------------------:|:-------------------------:|:----------------------:|:-------------------------:|
| whisper-large-v3                        |         37.02          |          22.63            |         24.03          |           8.49            |
| whisper-medium                          |         55.64          |          43.01            |         37.55          |          16.41            |
| biodatlab-whisper-th-medium-combined    |         31.00          |          14.25            |         21.20          |           5.69            |
| biodatlab-whisper-th-large-v3-combined  |         29.02          |          15.72            |         19.96          |           6.32            |
| monsoon-whisper-medium-gigaspeech2      |         22.74          |          20.79            |         14.15          |           6.92            |


## Intended Uses & Limitations
This model is experimental and may not always be accurate. Developers should carefully assess potential risks in the context of their specific applications.

## Follow us & Support
- https://twitter.com/opentyphoon
- https://discord.gg/CqyBscMFpg


## Typhoon Team
*Kunat Pipatanakul*, Potsawee Manakul, Sittipong Sripaisarnmongkol, Natapong Nitarach, Warit Sirichotedumrong, Adisai Na-Thalang, Phatrasek Jirabovonvisut, Parinthapat Pengpun, 
Krisanapong Jirayoot, Pathomporn Chokchainant, Kasima Tharnpipitchai