|
--- |
|
license: mit |
|
language: |
|
- ar |
|
pipeline_tag: text-to-speech |
|
--- |
|
|
|
ArTST: SpeechT5 for Arabic (TTS task) |
|
|
|
Here we use the pretained weights from ArTST and fine-tuned using huggingface implementation of SpeechT5 on Classical Arabic ClArTTS for speech synthesis (text-to-speech). |
|
|
|
ArTST was first released in [this repository](https://github.com/mbzuai-nlp/ArTST ), [pretrained weights](https://huggingface.co/MBZUAI/ArTST/blob/main/pretrain_checkpoint.pt). |
|
|
|
# Uses |
|
## ๐ค Transformers Usage |
|
|
|
You can run ArTST TTS locally with the ๐ค Transformers library. |
|
|
|
1. First install the ๐ค [Transformers library](https://github.com/huggingface/transformers), sentencepiece, soundfile and datasets(optional): |
|
|
|
``` |
|
pip install --upgrade pip |
|
pip install --upgrade transformers sentencepiece datasets[audio] |
|
``` |
|
2. Run inference via the `Text-to-Speech` (TTS) pipeline. You can access the Arabic SPeechT5 model via the TTS pipeline in just a few lines of code! |
|
|
|
```python |
|
from transformers import pipeline |
|
from datasets import load_dataset |
|
import soundfile as sf |
|
|
|
synthesiser = pipeline("text-to-speech", "MBZUAI/speecht5_tts_clartts_ar") |
|
|
|
embeddings_dataset = load_dataset("herwoww/arabic_xvector_embeddings", split="validation") |
|
speaker_embedding = torch.tensor(embeddings_dataset[105]["speaker_embeddings"]).unsqueeze(0) |
|
# You can replace this embedding with your own as well. |
|
|
|
speech = synthesiser("ูุฃูู ูุง ูุฑู ุฃูู ุนูู ุงูุณูู ุซู
ู
ู ุจุนุฏ ุฐูู ุญุฏูุซ ู
ูุชุดุฑ", forward_params={"speaker_embeddings": speaker_embedding}) |
|
# ArTST is trained without diacritics. |
|
|
|
sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"]) |
|
``` |
|
3. Run inference via the Transformers modelling code - You can use the processor + generate code to convert text into a mono 16 kHz speech waveform for more fine-grained control. |
|
|
|
```python |
|
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan |
|
from datasets import load_dataset |
|
import torch |
|
import soundfile as sf |
|
from datasets import load_dataset |
|
|
|
processor = SpeechT5Processor.from_pretrained("MBZUAI/speecht5_tts_clartts_ar") |
|
model = SpeechT5ForTextToSpeech.from_pretrained("MBZUAI/speecht5_tts_clartts_ar") |
|
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan") |
|
|
|
inputs = processor(text="ูุฃูู ูุง ูุฑู ุฃูู ุนูู ุงูุณูู ุซู
ู
ู ุจุนุฏ ุฐูู ุญุฏูุซ ู
ูุชุดุฑ", return_tensors="pt") |
|
|
|
# load xvector containing speaker's voice characteristics from a dataset |
|
embeddings_dataset = load_dataset("herwoww/arabic_xvector_embeddings", split="validation") |
|
speaker_embedding = torch.tensor(embeddings_dataset[105]["speaker_embeddings"]).unsqueeze(0) |
|
|
|
speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder) |
|
|
|
sf.write("speech.wav", speech.numpy(), samplerate=16000) |
|
``` |
|
|
|
|
|
# Citation |
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
|
|
**BibTeX:** |
|
|
|
```bibtex |
|
@inproceedings{toyin-etal-2023-artst, |
|
title = "{A}r{TST}: {A}rabic Text and Speech Transformer", |
|
author = "Toyin, Hawau and |
|
Djanibekov, Amirbek and |
|
Kulkarni, Ajinkya and |
|
Aldarmaki, Hanan", |
|
editor = "Sawaf, Hassan and |
|
El-Beltagy, Samhaa and |
|
Zaghouani, Wajdi and |
|
Magdy, Walid and |
|
Abdelali, Ahmed and |
|
Tomeh, Nadi and |
|
Abu Farha, Ibrahim and |
|
Habash, Nizar and |
|
Khalifa, Salam and |
|
Keleg, Amr and |
|
Haddad, Hatem and |
|
Zitouni, Imed and |
|
Mrini, Khalil and |
|
Almatham, Rawan", |
|
booktitle = "Proceedings of ArabicNLP 2023", |
|
month = dec, |
|
year = "2023", |
|
address = "Singapore (Hybrid)", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://aclanthology.org/2023.arabicnlp-1.5", |
|
pages = "41--51" |
|
} |
|
@inproceedings{ao-etal-2022-speecht5, |
|
title = {{S}peech{T}5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing}, |
|
author = {Ao, Junyi and Wang, Rui and Zhou, Long and Wang, Chengyi and Ren, Shuo and Wu, Yu and Liu, Shujie and Ko, Tom and Li, Qing and Zhang, Yu and Wei, Zhihua and Qian, Yao and Li, Jinyu and Wei, Furu}, |
|
booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, |
|
month = {May}, |
|
year = {2022}, |
|
pages={5723--5738}, |
|
} |
|
``` |
|
|