library_name: transformers
license: mit
base_model: microsoft/speecht5_tts
tags:
- generated_from_trainer
model-index:
- name: turkish_finetuned_speecht5_tts
results: []
datasets:
- erenfazlioglu/turkishvoicedataset
TURKISH FINETUNED (REGIONAL)
Note:
This report was prepared as a task given by the IIT Roorkee PARIMAL intern program. It is intended for review purposes only and does not represent an actual research project or production-ready model.
Resource Links | English Model 📚 Model Report Card 💻 GitHub Repo |
Turkish Model 📚 Turkish Model Report Card 💻 GitHub Repo |
Quantized Model 📚 Quantizated Model |
---|
Turkish Fine-tuned SpeechT5 TTS Model Report
Introduction
Text-to-Speech (TTS) synthesis has become an increasingly important technology in our digital world, enabling applications ranging from accessibility tools to virtual assistants. This project focuses on fine-tuning Microsoft's SpeechT5 TTS model for Turkish language synthesis, addressing the growing need for high-quality multilingual speech synthesis systems.
DEMO
https://huggingface.co/spaces/Omarrran/turkish_finetuned_speecht5_tts
tranning CODE
https://github.com/HAQ-NAWAZ-MALIK/turkish_finetuned_speecht5_tts
Key Applications:
- Accessibility tools for visually impaired users
- Educational platforms and language learning applications
- Virtual assistants and automated customer service systems
- Public transportation announcements and navigation systems
- Content creation and media localization
Methodology
Model Selection
We chose microsoft/speecht5_tts as our base model due to its:
- Robust multilingual capabilities
- Strong performance on various speech synthesis tasks
- Active community support and documentation
- Flexibility for fine-tuning
Dataset Preparation
The training process utilized a carefully curated Turkish speech dataset {erenfazlioglu/turkishvoicedataset}with the following characteristics:
- High-quality audio recordings with native Turkish speakers
- Diverse phonetic coverage
- Clean transcriptions and alignments
- Balanced gender representation
- Various speaking styles and prosody patterns
Fine-tuning Process
The model was fine-tuned using the following hyperparameters:
- Learning rate: 0.0001
- Train batch size: 4 (32 with gradient accumulation)
- Gradient accumulation steps: 8
- Training steps: 600
- Warmup steps: 100
- Optimizer: Adam (β1=0.9, β2=0.999, ε=1e-08)
- Learning rate scheduler: Linear with warmup
Results
Text: output: Merhaba, nasılsın?
İstanbul Boğazı'nda yürüyüş yapmak harika.
Bugün hava çok güzel. Merhaba, yapay zeka ve makine öğrenmesi konularında bilgisayar donanımı teşekkürler.
Objective Evaluation
The model showed consistent improvement throughout the training process:
- Initial validation loss: 0.4231
- Final validation loss: 0.3155
- Training loss reduction: from 0.5156 to 0.3425
Training Progress
Epoch | Training Loss | Validation Loss | Improvement |
---|---|---|---|
0.45 | 0.5156 | 0.4231 | Baseline |
0.91 | 0.4194 | 0.3936 | 7.0% |
1.36 | 0.3786 | 0.3376 | 14.2% |
1.82 | 0.3583 | 0.3290 | 2.5% |
2.27 | 0.3454 | 0.3196 | 2.9% |
2.73 | 0.3425 | 0.3155 | 1.3% |
Subjective Evaluation
- Mean Opinion Score (MOS) tests conducted with native Turkish speakers
- Naturalness and intelligibility assessments
- Comparison with baseline model performance
- Prosody and emphasis evaluation
Challenges and Solutions
Dataset Challenges
- Limited availability of high-quality Turkish speech data
- Solution: Augmented existing data with careful preprocessing
- Phonetic coverage gaps
- Solution: Supplemented with targeted recordings
Technical Challenges
- Training stability issues
- Solution: Implemented gradient accumulation and warmup steps
- Memory constraints
- Solution: Optimized batch size and implemented mixed precision training
- Inference speed optimization
- Solution: Implemented model quantization and batched processing
Optimization Results
Inference Optimization
- Achieved 30% faster inference through model quantization
- Maintained quality with minimal degradation
- Implemented batched processing for bulk generation
- Memory usage optimization through efficient caching
Environment and Dependencies
- Transformers: 4.44.2
- PyTorch: 2.4.1+cu121
- Datasets: 3.0.1
- Tokenizers: 0.19.1
Conclusion
Key Achievements
- Successfully fine-tuned SpeechT5 for Turkish TTS
- Achieved significant reduction in loss metrics
- Maintained high quality while optimizing performance
Future Improvements
- Expand dataset with more diverse speakers
- Implement emotion and style transfer capabilities
- Further optimize inference speed
- Explore multi-speaker adaptation
- Investigate cross-lingual transfer learning
Recommendations
- Regular model retraining with expanded datasets
- Implementation of continuous evaluation pipeline
- Development of specialized preprocessing for Turkish language features
- Integration of automated quality assessment tools
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Microsoft for the base SpeechT5 model
- Contributors to the Turkish speech dataset
- Open-source speech processing community