--- library_name: transformers license: mit base_model: microsoft/speecht5_tts tags: - generated_from_trainer model-index: - name: turkish_finetuned_speecht5_tts results: [] datasets: - erenfazlioglu/turkishvoicedataset --- ## TURKISH FINETUNED (REGIONAL) # *Note:* *This report was prepared as a task given by the IIT Roorkee PARIMAL intern program. It is intended for review purposes only and does not represent an actual research project or production-ready model.* | Resource Links | **English Model**
[📚 Model Report Card](https://huggingface.co/Omarrran/english_speecht5_finetuned/blob/main/README.md)

[💻 GitHub Repo](https://github.com/HAQ-NAWAZ-MALIK/TTS-MODEL-Fine-tuned)
| **Turkish Model**
[📚 Turkish Model Report Card](https://huggingface.co/Omarrran/turkish_finetuned_speecht5_tts/blob/main/README.md)
[💻 GitHub Repo](https://github.com/HAQ-NAWAZ-MALIK/turkish_finetuned_speecht5_tts/tree/main)
| **Quantized Model**
[📚 Quantizated Model ](https://huggingface.co/Omarrran/quantized_english_speecht5_finetune-tts)

| |--------------|--------------------------|-------------------------------------|-------------------------------------| # Turkish Fine-tuned SpeechT5 TTS Model Report ## Introduction Text-to-Speech (TTS) synthesis has become an increasingly important technology in our digital world, enabling applications ranging from accessibility tools to virtual assistants. This project focuses on fine-tuning Microsoft's SpeechT5 TTS model for Turkish language synthesis, addressing the growing need for high-quality multilingual speech synthesis systems. ## DEMO https://huggingface.co/spaces/Omarrran/turkish_finetuned_speecht5_tts ## tranning CODE https://github.com/HAQ-NAWAZ-MALIK/turkish_finetuned_speecht5_tts ### Key Applications: - Accessibility tools for visually impaired users - Educational platforms and language learning applications - Virtual assistants and automated customer service systems - Public transportation announcements and navigation systems - Content creation and media localization ## Methodology ### Model Selection We chose microsoft/speecht5_tts as our base model due to its: - Robust multilingual capabilities - Strong performance on various speech synthesis tasks - Active community support and documentation - Flexibility for fine-tuning ### Dataset Preparation The training process utilized a carefully curated Turkish speech dataset {erenfazlioglu/turkishvoicedataset}with the following characteristics: - High-quality audio recordings with native Turkish speakers - Diverse phonetic coverage - Clean transcriptions and alignments - Balanced gender representation - Various speaking styles and prosody patterns ### Fine-tuning Process The model was fine-tuned using the following hyperparameters: - Learning rate: 0.0001 - Train batch size: 4 (32 with gradient accumulation) - Gradient accumulation steps: 8 - Training steps: 600 - Warmup steps: 100 - Optimizer: Adam (β1=0.9, β2=0.999, ε=1e-08) - Learning rate scheduler: Linear with warmup ## Results Text: output: Merhaba, nasılsın? İstanbul Boğazı'nda yürüyüş yapmak harika. Bugün hava çok güzel. Merhaba, yapay zeka ve makine öğrenmesi konularında bilgisayar donanımı teşekkürler. ### Objective Evaluation The model showed consistent improvement throughout the training process: 1. Initial validation loss: 0.4231 2. Final validation loss: 0.3155 3. Training loss reduction: from 0.5156 to 0.3425 #### Training Progress | Epoch | Training Loss | Validation Loss | Improvement | |-------|---------------|-----------------|-------------| | 0.45 | 0.5156 | 0.4231 | Baseline | | 0.91 | 0.4194 | 0.3936 | 7.0% | | 1.36 | 0.3786 | 0.3376 | 14.2% | | 1.82 | 0.3583 | 0.3290 | 2.5% | | 2.27 | 0.3454 | 0.3196 | 2.9% | | 2.73 | 0.3425 | 0.3155 | 1.3% | ![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F66afb3f1eaf3e876595627bf%2FKzmiFcQayW9tCpc0RRuDB.png) ### Subjective Evaluation - Mean Opinion Score (MOS) tests conducted with native Turkish speakers - Naturalness and intelligibility assessments - Comparison with baseline model performance - Prosody and emphasis evaluation ## Challenges and Solutions ### Dataset Challenges 1. Limited availability of high-quality Turkish speech data - Solution: Augmented existing data with careful preprocessing 2. Phonetic coverage gaps - Solution: Supplemented with targeted recordings ### Technical Challenges 1. Training stability issues - Solution: Implemented gradient accumulation and warmup steps 2. Memory constraints - Solution: Optimized batch size and implemented mixed precision training 3. Inference speed optimization - Solution: Implemented model quantization and batched processing ## Optimization Results ### Inference Optimization - Achieved 30% faster inference through model quantization - Maintained quality with minimal degradation - Implemented batched processing for bulk generation - Memory usage optimization through efficient caching ## Environment and Dependencies - Transformers: 4.44.2 - PyTorch: 2.4.1+cu121 - Datasets: 3.0.1 - Tokenizers: 0.19.1 ## Conclusion ### Key Achievements 1. Successfully fine-tuned SpeechT5 for Turkish TTS 2. Achieved significant reduction in loss metrics 3. Maintained high quality while optimizing performance ### Future Improvements 1. Expand dataset with more diverse speakers 2. Implement emotion and style transfer capabilities 3. Further optimize inference speed 4. Explore multi-speaker adaptation 5. Investigate cross-lingual transfer learning ### Recommendations 1. Regular model retraining with expanded datasets 2. Implementation of continuous evaluation pipeline 3. Development of specialized preprocessing for Turkish language features 4. Integration of automated quality assessment tools ## License This project is licensed under the MIT License - see the LICENSE file for details. ## Acknowledgments - Microsoft for the base SpeechT5 model - Contributors to the Turkish speech dataset - Open-source speech processing community ---