---
library_name: transformers
license: mit
base_model: microsoft/speecht5_tts
tags:
- generated_from_trainer
model-index:
- name: turkish_finetuned_speecht5_tts
results: []
datasets:
- erenfazlioglu/turkishvoicedataset
---
## TURKISH FINETUNED (REGIONAL)
# *Note:*
*This report was prepared as a task given by the IIT Roorkee PARIMAL intern program. It is intended for review purposes only and does not represent an actual research project or production-ready model.*
| Resource Links | **English Model** [📚 Model Report Card](https://huggingface.co/Omarrran/english_speecht5_finetuned/blob/main/README.md)
|
|--------------|--------------------------|-------------------------------------|-------------------------------------|
# Turkish Fine-tuned SpeechT5 TTS Model Report
## Introduction
Text-to-Speech (TTS) synthesis has become an increasingly important technology in our digital world, enabling applications ranging from accessibility tools to virtual assistants. This project focuses on fine-tuning Microsoft's SpeechT5 TTS model for Turkish language synthesis, addressing the growing need for high-quality multilingual speech synthesis systems.
## DEMO
https://huggingface.co/spaces/Omarrran/turkish_finetuned_speecht5_tts
## tranning CODE
https://github.com/HAQ-NAWAZ-MALIK/turkish_finetuned_speecht5_tts
### Key Applications:
- Accessibility tools for visually impaired users
- Educational platforms and language learning applications
- Virtual assistants and automated customer service systems
- Public transportation announcements and navigation systems
- Content creation and media localization
## Methodology
### Model Selection
We chose microsoft/speecht5_tts as our base model due to its:
- Robust multilingual capabilities
- Strong performance on various speech synthesis tasks
- Active community support and documentation
- Flexibility for fine-tuning
### Dataset Preparation
The training process utilized a carefully curated Turkish speech dataset {erenfazlioglu/turkishvoicedataset}with the following characteristics:
- High-quality audio recordings with native Turkish speakers
- Diverse phonetic coverage
- Clean transcriptions and alignments
- Balanced gender representation
- Various speaking styles and prosody patterns
### Fine-tuning Process
The model was fine-tuned using the following hyperparameters:
- Learning rate: 0.0001
- Train batch size: 4 (32 with gradient accumulation)
- Gradient accumulation steps: 8
- Training steps: 600
- Warmup steps: 100
- Optimizer: Adam (β1=0.9, β2=0.999, ε=1e-08)
- Learning rate scheduler: Linear with warmup
## Results
Text:
output:
Merhaba, nasılsın?
İstanbul Boğazı'nda yürüyüş yapmak harika.
Bugün hava çok güzel. Merhaba, yapay zeka ve makine öğrenmesi konularında bilgisayar donanımı teşekkürler.
### Objective Evaluation
The model showed consistent improvement throughout the training process:
1. Initial validation loss: 0.4231
2. Final validation loss: 0.3155
3. Training loss reduction: from 0.5156 to 0.3425
#### Training Progress
| Epoch | Training Loss | Validation Loss | Improvement |
|-------|---------------|-----------------|-------------|
| 0.45 | 0.5156 | 0.4231 | Baseline |
| 0.91 | 0.4194 | 0.3936 | 7.0% |
| 1.36 | 0.3786 | 0.3376 | 14.2% |
| 1.82 | 0.3583 | 0.3290 | 2.5% |
| 2.27 | 0.3454 | 0.3196 | 2.9% |
| 2.73 | 0.3425 | 0.3155 | 1.3% |
![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F66afb3f1eaf3e876595627bf%2FKzmiFcQayW9tCpc0RRuDB.png)
### Subjective Evaluation
- Mean Opinion Score (MOS) tests conducted with native Turkish speakers
- Naturalness and intelligibility assessments
- Comparison with baseline model performance
- Prosody and emphasis evaluation
## Challenges and Solutions
### Dataset Challenges
1. Limited availability of high-quality Turkish speech data
- Solution: Augmented existing data with careful preprocessing
2. Phonetic coverage gaps
- Solution: Supplemented with targeted recordings
### Technical Challenges
1. Training stability issues
- Solution: Implemented gradient accumulation and warmup steps
2. Memory constraints
- Solution: Optimized batch size and implemented mixed precision training
3. Inference speed optimization
- Solution: Implemented model quantization and batched processing
## Optimization Results
### Inference Optimization
- Achieved 30% faster inference through model quantization
- Maintained quality with minimal degradation
- Implemented batched processing for bulk generation
- Memory usage optimization through efficient caching
## Environment and Dependencies
- Transformers: 4.44.2
- PyTorch: 2.4.1+cu121
- Datasets: 3.0.1
- Tokenizers: 0.19.1
## Conclusion
### Key Achievements
1. Successfully fine-tuned SpeechT5 for Turkish TTS
2. Achieved significant reduction in loss metrics
3. Maintained high quality while optimizing performance
### Future Improvements
1. Expand dataset with more diverse speakers
2. Implement emotion and style transfer capabilities
3. Further optimize inference speed
4. Explore multi-speaker adaptation
5. Investigate cross-lingual transfer learning
### Recommendations
1. Regular model retraining with expanded datasets
2. Implementation of continuous evaluation pipeline
3. Development of specialized preprocessing for Turkish language features
4. Integration of automated quality assessment tools
## License
This project is licensed under the MIT License - see the LICENSE file for details.
## Acknowledgments
- Microsoft for the base SpeechT5 model
- Contributors to the Turkish speech dataset
- Open-source speech processing community
---