Update README.md
Browse files
README.md
CHANGED
@@ -1,5 +1,12 @@
|
|
1 |
---
|
2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
4 |
|
5 |
<style>
|
@@ -25,6 +32,16 @@ license: mit
|
|
25 |
display: block;
|
26 |
}
|
27 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
</style>
|
29 |
|
30 |
<hr>
|
@@ -38,29 +55,74 @@ license: mit
|
|
38 |
|
39 |
<hr>
|
40 |
|
41 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
42 |
|
43 |
-
|
44 |
-
|
45 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
46 |
|
47 |
-
|
48 |
|
49 |
-
|
|
|
50 |
|
51 |
-
### Demo Spaces
|
52 |
-
Coming soon...
|
53 |
|
54 |
-
|
55 |
-
- [DagsHub](https://dagshub.com) who sponsored us with their GPU compute (with special thanks to Dean!)
|
56 |
-
- And the assistance from [camenduru](https://github.com/camenduru) on cloud infrastructure and model training
|
57 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
58 |
<hr>
|
59 |
|
60 |
-
<a href="https://discord.gg/5bq9HqVhsJ"><img src="https://img.shields.io/badge/find_us_at_the-ShoukanLabs_Discord-invite?style=flat-square&logo=discord&logoColor=%23ffffff&labelColor=%235865F2&color=%23ffffff" width="320" alt="discord"></a>
|
61 |
-
<!--<a align="left" style="font-size: 1.3rem; font-weight: bold; color: #5662f6;" href="https://discord.gg/5bq9HqVhsJ">find us on Discord</a>-->
|
62 |
|
63 |
-
|
64 |
|
65 |
```citations
|
66 |
@misc{li2023styletts,
|
@@ -87,9 +149,8 @@ The Centre for Speech Technology Research (CSTR),
|
|
87 |
University of Edinburgh
|
88 |
```
|
89 |
|
90 |
-
|
|
|
91 |
```
|
92 |
MIT
|
93 |
-
```
|
94 |
-
|
95 |
-
Stay tuned for Vokan V2!
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
+
datasets:
|
4 |
+
- ShoukanLabs/AniSpeech
|
5 |
+
- vctk
|
6 |
+
- blabble-io/libritts_r
|
7 |
+
language:
|
8 |
+
- en
|
9 |
+
pipeline_tag: text-to-speech
|
10 |
---
|
11 |
|
12 |
<style>
|
|
|
32 |
display: block;
|
33 |
}
|
34 |
|
35 |
+
audio {
|
36 |
+
margin: 0.5rem;
|
37 |
+
}
|
38 |
+
|
39 |
+
.audio-container {
|
40 |
+
display: flex;
|
41 |
+
justify-content: center;
|
42 |
+
align-items: center;
|
43 |
+
}
|
44 |
+
|
45 |
</style>
|
46 |
|
47 |
<hr>
|
|
|
55 |
|
56 |
<hr>
|
57 |
|
58 |
+
<a href="https://discord.gg/5bq9HqVhsJ"><img src="https://img.shields.io/badge/find_us_at_the-ShoukanLabs_Discord-invite?style=flat-square&logo=discord&logoColor=%23ffffff&labelColor=%235865F2&color=%23ffffff" width="320" alt="discord"></a>
|
59 |
+
<!--<a align="left" style="font-size: 1.3rem; font-weight: bold; color: #5662f6;" href="https://discord.gg/5bq9HqVhsJ">find us on Discord</a>-->
|
60 |
+
|
61 |
+
|
62 |
+
**Vokan** is an advanced finetuned **StyleTTS2** model crafted for authentic and expressive zero-shot performance. Designed to serve as a better
|
63 |
+
base model fo further finetuning in the future!
|
64 |
+
It leverages a diverse dataset and extensive training to generate high-quality synthesized speech.
|
65 |
+
Trained on a combination of the AniSpeech, VCTK, and LibriTTS-R datasets, Vokan ensures authenticity and naturalness across various accents and contexts.
|
66 |
+
With over 6+ days worth of audio data and 672 diverse and expressive speakers,
|
67 |
+
Vokan captures a wide range of vocal characteristics, contributing to its remarkable performance.
|
68 |
+
Although the amount of training data is less than the original, the inclusion of a broad array of accents and speakers enriches the model's vector space.
|
69 |
+
Vokan's training required significant computational resources, including 300 hours on 1x H100 and an additional 600 hours on 1x 3090 hardware configuration.
|
70 |
+
|
71 |
+
You can read more about it on our article on [DagsHub!](dagshub.com/blog/styletts2/)
|
72 |
+
|
73 |
|
74 |
+
<hr>
|
75 |
+
<p align="center", style="font-size: 2vw; font-weight: bold; color: #ff593e;">Vokan Samples!</p>
|
76 |
+
<div class='audio-container'>
|
77 |
+
<div>
|
78 |
+
<audio controls>
|
79 |
+
<source src="https://dagshub.com/StyleTTS/Article/raw/74539c801ce3a894ec3df6b52fa2dd579637481d/demo%201.wav" type="audio/wav">
|
80 |
+
Your browser does not support the audio element.
|
81 |
+
</audio>
|
82 |
+
</div>
|
83 |
+
|
84 |
+
<div>
|
85 |
+
<audio controls>
|
86 |
+
<source src="https://dagshub.com/StyleTTS/Article/raw/74539c801ce3a894ec3df6b52fa2dd579637481d/demo%202.wav" type="audio/wav">
|
87 |
+
Your browser does not support the audio element.
|
88 |
+
</audio>
|
89 |
+
</div>
|
90 |
+
</div>
|
91 |
+
<div class='audio-container'>
|
92 |
+
<div>
|
93 |
+
<audio controls>
|
94 |
+
<source src="https://dagshub.com/StyleTTS/Article/raw/74539c801ce3a894ec3df6b52fa2dd579637481d/demo%203.wav" type="audio/wav">
|
95 |
+
Your browser does not support the audio element.
|
96 |
+
</audio>
|
97 |
+
</div>
|
98 |
+
<div>
|
99 |
+
<audio controls>
|
100 |
+
<source src="https://dagshub.com/StyleTTS/Article/raw/74539c801ce3a894ec3df6b52fa2dd579637481d/demo%204.wav" type="audio/wav">
|
101 |
+
Your browser does not support the audio element.
|
102 |
+
</audio>
|
103 |
+
</div>
|
104 |
+
</div>
|
105 |
+
<hr>
|
106 |
|
107 |
+
<p align="center", style="font-size: 2vw; font-weight: bold; color: #ff593e;">Acknowledgements</p>
|
108 |
|
109 |
+
- **[DagsHub](https://dagshub.com):** Special thanks to DagsHub for sponsoring GPU compute resources as well as offering an amazing versioning service, enabling efficient model training and development. A shoutout to Dean in particular!
|
110 |
+
- **[camenduru](https://github.com/camenduru):** Thanks to camenduru for their expertise in cloud infrastructure and model training, which played a crucial role in the development of Vokan! Please give them a follow!
|
111 |
|
|
|
|
|
112 |
|
113 |
+
<p align="center", style="font-size: 2vw; font-weight: bold; color: #ff593e;">Conclusion</p>
|
|
|
|
|
114 |
|
115 |
+
V2 is currently in the works, aiming to be bigger and better in every way! Including multilingual support!
|
116 |
+
This is where you come in, if you have any large single speaker datasets you'd like to contribute,
|
117 |
+
in any langauge, you can contribute to our **Vokan dataset**. A large **community dataset** that combines a bunch of
|
118 |
+
smaller single speaker datasets to create one big multispeaker one.
|
119 |
+
You can upload your uberduck or [FakeYou](https://fakeyou.com/) compliant datasets via the
|
120 |
+
**[Vokan](https://huggingface.co/ShoukanLabs/Vokan)** bot on the **[ShoukanLabs Discord Server](https://discord.gg/hdVeretude)**.
|
121 |
+
The more data we have, the better the models we produce will be!
|
122 |
<hr>
|
123 |
|
|
|
|
|
124 |
|
125 |
+
<p align="center", style="font-size: 2vw; font-weight: bold; color: #ff593e;">Citations</p>
|
126 |
|
127 |
```citations
|
128 |
@misc{li2023styletts,
|
|
|
149 |
University of Edinburgh
|
150 |
```
|
151 |
|
152 |
+
<p align="center", style="font-size: 2vw; font-weight: bold; color: #ff593e;">License</p>
|
153 |
+
|
154 |
```
|
155 |
MIT
|
156 |
+
```
|
|
|
|