|
--- |
|
license: mit |
|
language: |
|
- kbd |
|
datasets: |
|
- anzorq/kbd_speech |
|
- anzorq/sixuxar_yijiri_mak7 |
|
metrics: |
|
- wer |
|
pipeline_tag: automatic-speech-recognition |
|
--- |
|
|
|
# Circassian (Kabardian) ASR Model |
|
|
|
This is a fine-tuned model for Automatic Speech Recognition (ASR) in `kbd`, based on the `facebook/w2v-bert-2.0` model. |
|
|
|
The model was trained on a combination of the `anzorq/kbd_speech` (filtered on `country=russia`) and `anzorq/sixuxar_yijiri_mak7` datasets. |
|
|
|
## Model Details |
|
|
|
- **Base Model**: facebook/w2v-bert-2.0 |
|
- **Language**: Kabardian |
|
- **Task**: Automatic Speech Recognition (ASR) |
|
- **Datasets**: anzorq/kbd_speech, anzorq/sixuxar_yijiri_mak7 |
|
- **Training Steps**: 4000 |
|
|
|
## Training |
|
|
|
The model was fine-tuned using the following training arguments: |
|
|
|
```python |
|
TrainingArguments( |
|
output_dir='output', |
|
group_by_length=True, |
|
per_device_train_batch_size=8, |
|
gradient_accumulation_steps=2, |
|
evaluation_strategy="steps", |
|
num_train_epochs=10, |
|
gradient_checkpointing=True, |
|
fp16=True, |
|
save_steps=1000, |
|
eval_steps=500, |
|
logging_steps=300, |
|
learning_rate=5e-5, |
|
warmup_steps=500, |
|
save_total_limit=2, |
|
push_to_hub=True, |
|
report_to="wandb" |
|
) |
|
``` |
|
|
|
## Performance |
|
|
|
The model's performance during training: |
|
|
|
| Step | Training Loss | Validation Loss | Wer | |
|
|------|---------------|-----------------|----------| |
|
| 500 | 2.761100 | 0.572304 | 0.830552 | |
|
| 1000 | 0.325700 | 0.352516 | 0.678261 | |
|
| 1500 | 0.247000 | 0.271146 | 0.377438 | |
|
| 2000 | 0.179300 | 0.235156 | 0.319859 | |
|
| 2500 | 0.176100 | 0.229383 | 0.293537 | |
|
| 3000 | 0.171600 | 0.208033 | 0.310458 | |
|
| 3500 | 0.133200 | 0.199517 | 0.289542 | |
|
| **4000** | **0.117900** | **0.208304** | **0.258989** | **<-- this model** | |
|
| 4500 | 0.145400 | 0.184942 | 0.285311 | |
|
| 5000 | 0.129600 | 0.195167 | 0.372033 | |
|
| 5500 | 0.122600 | 0.203584 | 0.386369 | |
|
| 6000 | 0.196800 | 0.270521 | 0.687662 | |
|
|
|
## Note |
|
To optimize training and reduce tokenizer vocabulary size, prior to training the following digraphs in the training data were replaced with single characters: |
|
``` |
|
гъ -> ɣ |
|
дж -> j |
|
дз -> ӡ |
|
жь -> ʐ |
|
кӏ -> қ |
|
къ -> q |
|
кхъ -> qҳ |
|
лъ -> ɬ |
|
лӏ -> ԯ |
|
пӏ -> ԥ |
|
тӏ -> ҭ |
|
фӏ -> ჶ |
|
хь -> h |
|
хъ -> ҳ |
|
цӏ -> ҵ |
|
щӏ -> ɕ |
|
я -> йа |
|
``` |
|
After obtaining the transcription, reversed replacements can be applied to restore the original characters. |
|
|
|
## Inference |
|
```python |
|
import torchaudio |
|
from transformers import pipeline |
|
|
|
pipe = pipeline(model="anzorq/w2v-bert-2.0-kbd-v2", device=0) |
|
|
|
reversed_replacements = { |
|
'ɣ': 'гъ', 'j': 'дж', 'ӡ': 'дз', 'ʐ': 'жь', |
|
'қ': 'кӏ', 'q': 'къ', 'qҳ': 'кхъ', 'ɬ': 'лъ', |
|
'ԯ': 'лӏ', 'ԥ': 'пӏ', 'ҭ': 'тӏ', 'ჶ': 'фӏ', |
|
'h': 'хь', 'ҳ': 'хъ', 'ҵ': 'цӏ', 'ɕ': 'щӏ', |
|
'йа': 'я' |
|
} |
|
|
|
def reverse_replace_symbols(text): |
|
for orig, replacement in reversed_replacements.items(): |
|
text = text.replace(orig, replacement) |
|
return text |
|
|
|
def transcribe_speech(audio_path): |
|
waveform, sample_rate = torchaudio.load(audio_path) |
|
waveform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(waveform) |
|
torchaudio.save("temp.wav", waveform, 16000) |
|
transcription = pipe("temp.wav", chunk_length_s=10)['text'] |
|
transcription = reverse_replace_symbols(transcription) |
|
return transcription |
|
|
|
audio_path = "audio.wav" |
|
transcription = transcribe_speech(audio_path) |
|
print(f"Transcription: {transcription}") |
|
|
|
``` |
|
|