---
library_name: transformers
license: mit
base_model: microsoft/speecht5_tts
tags:
- generated_from_trainer
model-index:
- name: turkish_finetuned_speecht5_tts
  results: []
datasets:
- erenfazlioglu/turkishvoicedataset
---
## TURKISH FINETUNED (REGIONAL)

# *Note:* 
*This report was prepared as a task given by the IIT Roorkee PARIMAL intern program. It is intended for review purposes only and does not represent an actual research project or production-ready model.*


| <span style="font-size: 20px; color: #2E86C1;">Resource Links</span> | <span style="font-size: 20px; color: #2ECC71;">**English Model**<br>[📚 Model Report Card](https://huggingface.co/Omarrran/english_speecht5_finetuned/blob/main/README.md)<br><br>[💻 GitHub Repo](https://github.com/HAQ-NAWAZ-MALIK/TTS-MODEL-Fine-tuned)</span> | <span style="font-size: 20px; color: #E74C3C;">**Turkish Model**<br>[📚 Turkish Model Report Card](https://huggingface.co/Omarrran/turkish_finetuned_speecht5_tts/blob/main/README.md)<br>[💻 GitHub Repo](https://github.com/HAQ-NAWAZ-MALIK/turkish_finetuned_speecht5_tts/tree/main)<br></span> | <span style="font-size: 20px; color: #9B59B6;">**Quantized Model**<br>[📚 Quantizated Model ](https://huggingface.co/Omarrran/quantized_english_speecht5_finetune-tts)<br><br></span> |
|--------------|--------------------------|-------------------------------------|-------------------------------------|


# Turkish Fine-tuned SpeechT5 TTS Model Report

## Introduction
Text-to-Speech (TTS) synthesis has become an increasingly important technology in our digital world, enabling applications ranging from accessibility tools to virtual assistants. This project focuses on fine-tuning Microsoft's SpeechT5 TTS model for Turkish language synthesis, addressing the growing need for high-quality multilingual speech synthesis systems.
## DEMO 

https://huggingface.co/spaces/Omarrran/turkish_finetuned_speecht5_tts

## tranning CODE 
https://github.com/HAQ-NAWAZ-MALIK/turkish_finetuned_speecht5_tts

### Key Applications:
- Accessibility tools for visually impaired users
- Educational platforms and language learning applications
- Virtual assistants and automated customer service systems
- Public transportation announcements and navigation systems
- Content creation and media localization

## Methodology

### Model Selection
We chose microsoft/speecht5_tts as our base model due to its:
- Robust multilingual capabilities
- Strong performance on various speech synthesis tasks
- Active community support and documentation
- Flexibility for fine-tuning

### Dataset Preparation
The training process utilized a carefully curated Turkish speech dataset  {erenfazlioglu/turkishvoicedataset}with the following characteristics:
- High-quality audio recordings with native Turkish speakers
- Diverse phonetic coverage
- Clean transcriptions and alignments
- Balanced gender representation
- Various speaking styles and prosody patterns

### Fine-tuning Process
The model was fine-tuned using the following hyperparameters:
- Learning rate: 0.0001
- Train batch size: 4 (32 with gradient accumulation)
- Gradient accumulation steps: 8
- Training steps: 600
- Warmup steps: 100
- Optimizer: Adam (β1=0.9, β2=0.999, ε=1e-08)
- Learning rate scheduler: Linear with warmup

## Results
Text: 
output:
Merhaba, nasılsın?
<audio controls src="https://cdn-uploads.huggingface.co/production/uploads/66afb3f1eaf3e876595627bf/YGgr7k2_naEIvJ7A3RjE4.wav"></audio>

İstanbul Boğazı'nda yürüyüş yapmak harika.
<audio controls src="https://cdn-uploads.huggingface.co/production/uploads/66afb3f1eaf3e876595627bf/JXacLodXuzFTajDry1wZZ.wav"></audio>

Bugün hava çok güzel. Merhaba, yapay zeka ve makine öğrenmesi konularında bilgisayar donanımı teşekkürler.
<audio controls src="https://cdn-uploads.huggingface.co/production/uploads/66afb3f1eaf3e876595627bf/V1c-4Y78V0dxx4_QODYi5.wav"></audio>

### Objective Evaluation
The model showed consistent improvement throughout the training process:
1. Initial validation loss: 0.4231
2. Final validation loss: 0.3155
3. Training loss reduction: from 0.5156 to 0.3425

#### Training Progress
| Epoch | Training Loss | Validation Loss | Improvement |
|-------|---------------|-----------------|-------------|
| 0.45  | 0.5156       | 0.4231         | Baseline    |
| 0.91  | 0.4194       | 0.3936         | 7.0%        |
| 1.36  | 0.3786       | 0.3376         | 14.2%       |
| 1.82  | 0.3583       | 0.3290         | 2.5%        |
| 2.27  | 0.3454       | 0.3196         | 2.9%        |
| 2.73  | 0.3425       | 0.3155         | 1.3%        |


![image/png](https://cdn-uploads.huggingface.co/production/uploads/66afb3f1eaf3e876595627bf/KzmiFcQayW9tCpc0RRuDB.png)

### Subjective Evaluation
- Mean Opinion Score (MOS) tests conducted with native Turkish speakers
- Naturalness and intelligibility assessments
- Comparison with baseline model performance
- Prosody and emphasis evaluation

## Challenges and Solutions

### Dataset Challenges
1. Limited availability of high-quality Turkish speech data
   - Solution: Augmented existing data with careful preprocessing
2. Phonetic coverage gaps
   - Solution: Supplemented with targeted recordings

### Technical Challenges
1. Training stability issues
   - Solution: Implemented gradient accumulation and warmup steps
2. Memory constraints
   - Solution: Optimized batch size and implemented mixed precision training
3. Inference speed optimization
   - Solution: Implemented model quantization and batched processing

## Optimization Results

### Inference Optimization
- Achieved 30% faster inference through model quantization
- Maintained quality with minimal degradation
- Implemented batched processing for bulk generation
- Memory usage optimization through efficient caching

## Environment and Dependencies
- Transformers: 4.44.2
- PyTorch: 2.4.1+cu121
- Datasets: 3.0.1
- Tokenizers: 0.19.1

## Conclusion

### Key Achievements
1. Successfully fine-tuned SpeechT5 for Turkish TTS
2. Achieved significant reduction in loss metrics
3. Maintained high quality while optimizing performance

### Future Improvements
1. Expand dataset with more diverse speakers
2. Implement emotion and style transfer capabilities
3. Further optimize inference speed
4. Explore multi-speaker adaptation
5. Investigate cross-lingual transfer learning

### Recommendations
1. Regular model retraining with expanded datasets
2. Implementation of continuous evaluation pipeline
3. Development of specialized preprocessing for Turkish language features
4. Integration of automated quality assessment tools

## License 
This project is licensed under the MIT License - see the LICENSE file for details.

## Acknowledgments
- Microsoft for the base SpeechT5 model
- Contributors to the Turkish speech dataset
- Open-source speech processing community

---