speecht5_finetuned_essam2_ar

This model is a fine-tuned version of MBZUAI/speecht5_tts_clartts_ar on an unknown dataset. It achieves the following results on the evaluation set:

  • Loss: 0.3333

Uses

๐Ÿค— Transformers Usage

You can run ArTST TTS locally with the ๐Ÿค— Transformers library.

  1. First install the ๐Ÿค— Transformers library, sentencepiece, soundfile and datasets(optional):
pip install --upgrade pip
pip install --upgrade transformers sentencepiece datasets[audio]
  1. Run inference via the Text-to-Speech (TTS) pipeline. You can access the Arabic SPeechT5 model via the TTS pipeline in just a few lines of code!
from transformers import pipeline
from datasets import load_dataset
import soundfile as sf

synthesiser = pipeline("text-to-speech", "("Messam174/speecht5_finetuned_essam2_ar")

embeddings_dataset = load_dataset("herwoww/arabic_xvector_embeddings", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[105]["speaker_embeddings"]).unsqueeze(0)
# You can replace this embedding with your own as well.

speech = synthesiser("ุงู„ุณู„ุงู… ุนู„ูŠูƒู… ูˆุฑุญู…ุฉ ุงู„ู„ู‡ ูˆุจุฑูƒุงุชู‡ ุญูŠุงูƒู… ุงู„ู„ู‡ ุฌู…ูŠุนุง", forward_params={"speaker_embeddings": speaker_embedding})
# ArTST is trained without diacritics.

sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"])
  1. Run inference via the Transformers modelling code - You can use the processor + generate code to convert text into a mono 16 kHz speech waveform for more fine-grained control.
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
from datasets import load_dataset
import torch
import soundfile as sf
from pydub import AudioSegment

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Load processor, model, and vocoder
processor = SpeechT5Processor.from_pretrained("Messam174/speecht5_finetuned_essam2_ar")
model = SpeechT5ForTextToSpeech.from_pretrained("Messam174/speecht5_finetuned_essam2_ar").to(device)
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan").to(device)

# Prepare inputs
inputs = processor(
    text="ุงู„ุณู„ุงู… ุนู„ูŠูƒู… ูˆุฑุญู…ุฉ ุงู„ู„ู‡ ูˆุจุฑูƒุงุชู‡ ุญูŠุงูƒู… ุงู„ู„ู‡ ุฌู…ูŠุนุง", return_tensors="pt"
).to(device)

# Load xvector containing speaker's voice characteristics from a dataset
embeddings_dataset = load_dataset("herwoww/arabic_xvector_embeddings", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[105]["speaker_embeddings"]).unsqueeze(0).to(device)

# Generate speech
with torch.no_grad():  # Disable gradient computation for inference
    speech = model.generate_speech(inputs["input_ids"], speaker_embedding, vocoder=vocoder)

# Save the output as WAV
wav_file = "speech.wav"
sf.write(wav_file, speech.cpu().numpy(), samplerate=16000)
print(f"Speech saved to '{wav_file}'")


Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0001
  • train_batch_size: 4
  • eval_batch_size: 2
  • seed: 42
  • gradient_accumulation_steps: 8
  • total_train_batch_size: 32
  • optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 100
  • training_steps: 500
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss
0.3806 0.3742 100 0.3452
0.3873 0.7484 200 0.3487
0.3788 1.1225 300 0.3441
0.3676 1.4967 400 0.3380
0.3668 1.8709 500 0.3333

Framework versions

  • Transformers 4.46.3
  • Pytorch 2.5.1+cu121
  • Datasets 3.2.0
  • Tokenizers 0.20.3
Downloads last month
47
Safetensors
Model size
144M params
Tensor type
F32
ยท
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for Messam174/speecht5_finetuned_essam2_ar

Finetuned
(13)
this model