Duplicate from capleaf/viXTTS
Browse filesCo-authored-by: Thinh Le <[email protected]>
- .gitattributes +35 -0
- LICENSE.txt +84 -0
- README.md +49 -0
- config.json +176 -0
- model.pth +3 -0
- samples/nam-calm.wav +0 -0
- samples/nam-cham.wav +0 -0
- samples/nam-nhanh.wav +0 -0
- samples/nam-truyen-cam.wav +0 -0
- samples/nu-calm.wav +0 -0
- samples/nu-cham.wav +0 -0
- samples/nu-luu-loat.wav +0 -0
- samples/nu-nhan-nha.wav +0 -0
- samples/nu-nhe-nhang.wav +0 -0
- vi_sample.wav +0 -0
- vocab.json +0 -0
.gitattributes
ADDED
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
LICENSE.txt
ADDED
@@ -0,0 +1,84 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Coqui Public Model License 1.0.0
|
2 |
+
https://coqui.ai/cpml.txt
|
3 |
+
|
4 |
+
|
5 |
+
This license allows only non-commercial use of a machine learning model and its outputs.
|
6 |
+
|
7 |
+
|
8 |
+
## Acceptance
|
9 |
+
|
10 |
+
|
11 |
+
In order to get any license under these terms, you must agree to them as both strict obligations and conditions to all your licenses.
|
12 |
+
|
13 |
+
|
14 |
+
## Licenses
|
15 |
+
|
16 |
+
|
17 |
+
The licensor grants you a copyright license to do everything you might do with the model that would otherwise infringe the licensor's copyright in it, for any non-commercial purpose. The licensor grants you a patent license that covers patent claims the licensor can license, or becomes able to license, that you would infringe by using the model in the form provided by
|
18 |
+
the licensor, for any non-commercial purpose.
|
19 |
+
|
20 |
+
|
21 |
+
## Non-commercial Purpose
|
22 |
+
|
23 |
+
|
24 |
+
Non-commercial purposes include any of the following uses of the model or its output, but only so far as you do not receive any direct or indirect payment arising from the use of the model or its output.
|
25 |
+
|
26 |
+
|
27 |
+
### Personal use for research, experiment, and testing for the benefit of public knowledge, personal study, private entertainment, hobby projects, amateur pursuits, or religious
|
28 |
+
observance.
|
29 |
+
|
30 |
+
|
31 |
+
### Use by commercial or for-profit entities for testing, evaluation, or non-commercial research and development. Use of the model to train other models for commercial use is not a non-commercial purpose.
|
32 |
+
|
33 |
+
|
34 |
+
### Use by any charitable organization for charitable purposes, or for testing or evaluation. Use for revenue-generating activity, including projects directly funded by government grants, is not a non-commercial purpose.
|
35 |
+
|
36 |
+
|
37 |
+
## Notices
|
38 |
+
|
39 |
+
|
40 |
+
You must ensure that anyone who gets a copy of any part of the model, or any modification of the model, or their output, from you also gets a copy of these terms or the URL for them above.
|
41 |
+
|
42 |
+
|
43 |
+
## No Other Rights
|
44 |
+
|
45 |
+
|
46 |
+
These terms do not allow you to sublicense or transfer any of your licenses to anyone else, or prevent the licensor from granting licenses to anyone else. These terms do not imply
|
47 |
+
any other licenses.
|
48 |
+
|
49 |
+
|
50 |
+
## Patent Defense
|
51 |
+
|
52 |
+
|
53 |
+
If you make any written claim that the model infringes or contributes to infringement of any patent, your licenses for the model granted under these terms ends immediately. If your company makes such a claim, your patent license ends immediately for work on behalf of your company.
|
54 |
+
|
55 |
+
|
56 |
+
## Violations
|
57 |
+
|
58 |
+
|
59 |
+
The first time you are notified in writing that you have violated any of these terms, or done anything with the model or its output that is not covered by your licenses, your licenses can nonetheless continue if you come into full compliance with these terms, and take practical steps to correct past violations, within 30 days of receiving notice. Otherwise, all your licenses
|
60 |
+
end immediately.
|
61 |
+
|
62 |
+
|
63 |
+
## No Liability
|
64 |
+
|
65 |
+
|
66 |
+
***As far as the law allows, the model and its output come as is, without any warranty or condition, and the licensor will not be liable to you for any damages arising out of these terms or the use or nature of the model or its output, under any kind of legal claim. If this provision is not enforceable in your jurisdiction, your licenses are void.***
|
67 |
+
|
68 |
+
|
69 |
+
## Definitions
|
70 |
+
|
71 |
+
|
72 |
+
The **licensor** is the individual or entity offering these terms, and the **model** is the model the licensor makes available under these terms, including any documentation or similar information about the model.
|
73 |
+
|
74 |
+
|
75 |
+
**You** refers to the individual or entity agreeing to these terms.
|
76 |
+
|
77 |
+
|
78 |
+
**Your company** is any legal entity, sole proprietorship, or other kind of organization that you work for, plus all organizations that have control over, are under the control of, or are under common control with that organization. **Control** means ownership of substantially all the assets of an entity, or the power to direct its management and policies by vote, contract, or otherwise. Control can be direct or indirect.
|
79 |
+
|
80 |
+
|
81 |
+
**Your licenses** are all the licenses granted to you under these terms.
|
82 |
+
|
83 |
+
|
84 |
+
**Use** means anything you do with the model or its output requiring one of your licenses.
|
README.md
ADDED
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: other
|
3 |
+
license_name: coqui-public-model-license
|
4 |
+
license_link: https://coqui.ai/cpml
|
5 |
+
pipeline_tag: text-to-speech
|
6 |
+
datasets:
|
7 |
+
- capleaf/viVoice
|
8 |
+
language:
|
9 |
+
- vi
|
10 |
+
---
|
11 |
+
|
12 |
+
# viⓍTTS
|
13 |
+
|
14 |
+
viⓍTTS là mô hình tạo sinh giọng nói cho phép bạn sao chép giọng nói sang các ngôn ngữ khác nhau chỉ bằng cách sử dụng một đoạn âm thanh nhanh dài 6 giây. Mô hình này được tiếp tục đào tạo từ mô hình [XTTS-v2.0.3](https://huggingface.co/coqui/XTTS-v2) bằng cách mở rộng tokenizer sang tiếng Việt và huấn luyện trên tập dữ liệu [viVoice](https://huggingface.co/datasets/thinhlpg/viVoice).
|
15 |
+
|
16 |
+
viⓍTTS is a voice generation model that lets you clone voices into different languages by using just a quick 6-second audio clip. This model is fine-tuned from the [XTTS-v2.0.3](https://huggingface.co/coqui/XTTS-v2) model by expanding the tokenizer to Vietnamese and fine-tuning on the [viVoice](https://huggingface.co/datasets/thinhlpg/viVoice) dataset.
|
17 |
+
|
18 |
+
### Languages
|
19 |
+
|
20 |
+
viXTTS supports 18 languages: English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt),
|
21 |
+
Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh-cn), Japanese (ja), Hungarian (hu), Korean (ko)
|
22 |
+
Hindi (hi), **Vietnamese (vi)**.
|
23 |
+
|
24 |
+
### Known Limitations
|
25 |
+
|
26 |
+
- Incompatibility with the [original TTS library](https://github.com/coqui-ai/TTS) (a pull request will be made later).
|
27 |
+
- Subpar performance for input sentences under 10 words in Vietnamese language (yielding inconsistent output and odd trailing sounds).
|
28 |
+
- This model is only fine-tuned in Vietnamese. The model's effectiveness with languages other than Vietnamese hasn't been tested, potentially reducing quality.
|
29 |
+
|
30 |
+
### Demo
|
31 |
+
|
32 |
+
Please checkout [this repo](https://github.com/thinhlpg/vixtts-demo)
|
33 |
+
|
34 |
+
### Usage
|
35 |
+
|
36 |
+
For a quick usage, please checkout [this notebook](https://colab.research.google.com/drive/1q9vA7mDyvK_u0ijDDNuycDoUUbryM3p3?usp=sharing)
|
37 |
+
|
38 |
+
### License
|
39 |
+
|
40 |
+
This model is licensed under [Coqui Public Model License](https://coqui.ai/cpml).
|
41 |
+
|
42 |
+
### Contact
|
43 |
+
|
44 |
+
Fine-tuned by Thinh Le at FPT University HCMC, as a component of [Non La](https://huggingface.co/capleaf)'s graduation thesis.
|
45 |
+
Contact:
|
46 |
+
|
47 |
+
- You can message me directly on Facebook: <https://fb.com/thinhlpg/> (preferred 🤗)
|
48 |
+
- GitHub: <https://github.com/thinhlpg>
|
49 |
+
- Email: <[email protected]> or <[email protected]>
|
config.json
ADDED
@@ -0,0 +1,176 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"output_path": "output",
|
3 |
+
"logger_uri": null,
|
4 |
+
"run_name": "run",
|
5 |
+
"project_name": null,
|
6 |
+
"run_description": "viXTTS training",
|
7 |
+
"print_step": null,
|
8 |
+
"plot_step": null,
|
9 |
+
"model_param_stats": false,
|
10 |
+
"wandb_entity": null,
|
11 |
+
"dashboard_logger": "tensorboard",
|
12 |
+
"save_on_interrupt": true,
|
13 |
+
"log_model_step": null,
|
14 |
+
"save_step": 24000,
|
15 |
+
"save_n_checkpoints": 2,
|
16 |
+
"save_checkpoints": true,
|
17 |
+
"save_all_best": false,
|
18 |
+
"save_best_after": 0,
|
19 |
+
"target_loss": null,
|
20 |
+
"print_eval": true,
|
21 |
+
"test_delay_epochs": 0,
|
22 |
+
"run_eval": true,
|
23 |
+
"run_eval_steps": null,
|
24 |
+
"distributed_backend": "nccl",
|
25 |
+
"distributed_url": "tcp://localhost:54321",
|
26 |
+
"mixed_precision": false,
|
27 |
+
"precision": "fp16",
|
28 |
+
"epochs": 5,
|
29 |
+
"batch_size": 2,
|
30 |
+
"eval_batch_size": 2,
|
31 |
+
"grad_clip": 0.0,
|
32 |
+
"scheduler_after_epoch": true,
|
33 |
+
"lr": 5e-06,
|
34 |
+
"optimizer": "AdamW",
|
35 |
+
"optimizer_params": {
|
36 |
+
"betas": [
|
37 |
+
0.9,
|
38 |
+
0.96
|
39 |
+
],
|
40 |
+
"eps": 1e-08,
|
41 |
+
"weight_decay": 0.01
|
42 |
+
},
|
43 |
+
"lr_scheduler": "MultiStepLR",
|
44 |
+
"lr_scheduler_params": {
|
45 |
+
"milestones": [
|
46 |
+
900000,
|
47 |
+
2700000,
|
48 |
+
5400000
|
49 |
+
],
|
50 |
+
"gamma": 0.5,
|
51 |
+
"last_epoch": -1
|
52 |
+
},
|
53 |
+
"use_grad_scaler": false,
|
54 |
+
"allow_tf32": false,
|
55 |
+
"cudnn_enable": true,
|
56 |
+
"cudnn_deterministic": false,
|
57 |
+
"cudnn_benchmark": false,
|
58 |
+
"training_seed": 1,
|
59 |
+
"model": "xtts",
|
60 |
+
"num_loader_workers": 0,
|
61 |
+
"num_eval_loader_workers": 0,
|
62 |
+
"use_noise_augment": false,
|
63 |
+
"audio": {
|
64 |
+
"sample_rate": 22050,
|
65 |
+
"output_sample_rate": 24000,
|
66 |
+
"dvae_sample_rate": 22050
|
67 |
+
},
|
68 |
+
"use_phonemes": false,
|
69 |
+
"phonemizer": null,
|
70 |
+
"phoneme_language": null,
|
71 |
+
"compute_input_seq_cache": false,
|
72 |
+
"text_cleaner": null,
|
73 |
+
"enable_eos_bos_chars": false,
|
74 |
+
"test_sentences_file": "",
|
75 |
+
"phoneme_cache_path": null,
|
76 |
+
"characters": null,
|
77 |
+
"add_blank": false,
|
78 |
+
"batch_group_size": 48,
|
79 |
+
"loss_masking": null,
|
80 |
+
"min_audio_len": 1,
|
81 |
+
"max_audio_len": Infinity,
|
82 |
+
"min_text_len": 1,
|
83 |
+
"max_text_len": Infinity,
|
84 |
+
"compute_f0": false,
|
85 |
+
"compute_energy": false,
|
86 |
+
"compute_linear_spec": false,
|
87 |
+
"precompute_num_workers": 0,
|
88 |
+
"start_by_longest": false,
|
89 |
+
"shuffle": false,
|
90 |
+
"drop_last": false,
|
91 |
+
"datasets": [
|
92 |
+
{
|
93 |
+
"formatter": "",
|
94 |
+
"dataset_name": "",
|
95 |
+
"path": "",
|
96 |
+
"meta_file_train": "",
|
97 |
+
"ignored_speakers": null,
|
98 |
+
"language": "",
|
99 |
+
"phonemizer": "",
|
100 |
+
"meta_file_val": "",
|
101 |
+
"meta_file_attn_mask": ""
|
102 |
+
}
|
103 |
+
],
|
104 |
+
"test_sentences": [],
|
105 |
+
"eval_split_max_size": null,
|
106 |
+
"eval_split_size": 0.01,
|
107 |
+
"use_speaker_weighted_sampler": false,
|
108 |
+
"speaker_weighted_sampler_alpha": 1.0,
|
109 |
+
"use_language_weighted_sampler": false,
|
110 |
+
"language_weighted_sampler_alpha": 1.0,
|
111 |
+
"use_length_weighted_sampler": false,
|
112 |
+
"length_weighted_sampler_alpha": 1.0,
|
113 |
+
"model_args": {
|
114 |
+
"gpt_batch_size": 1,
|
115 |
+
"enable_redaction": false,
|
116 |
+
"kv_cache": true,
|
117 |
+
"gpt_checkpoint": null,
|
118 |
+
"clvp_checkpoint": null,
|
119 |
+
"decoder_checkpoint": null,
|
120 |
+
"num_chars": 255,
|
121 |
+
"tokenizer_file": "",
|
122 |
+
"gpt_max_audio_tokens": 605,
|
123 |
+
"gpt_max_text_tokens": 402,
|
124 |
+
"gpt_max_prompt_tokens": 70,
|
125 |
+
"gpt_layers": 30,
|
126 |
+
"gpt_n_model_channels": 1024,
|
127 |
+
"gpt_n_heads": 16,
|
128 |
+
"gpt_number_text_tokens": 7544,
|
129 |
+
"gpt_start_text_token": null,
|
130 |
+
"gpt_stop_text_token": null,
|
131 |
+
"gpt_num_audio_tokens": 1026,
|
132 |
+
"gpt_start_audio_token": 1024,
|
133 |
+
"gpt_stop_audio_token": 1025,
|
134 |
+
"gpt_code_stride_len": 1024,
|
135 |
+
"gpt_use_masking_gt_prompt_approach": true,
|
136 |
+
"gpt_use_perceiver_resampler": true,
|
137 |
+
"input_sample_rate": 22050,
|
138 |
+
"output_sample_rate": 24000,
|
139 |
+
"output_hop_length": 256,
|
140 |
+
"decoder_input_dim": 1024,
|
141 |
+
"d_vector_dim": 512,
|
142 |
+
"cond_d_vector_in_each_upsampling_layer": true,
|
143 |
+
"duration_const": 102400
|
144 |
+
},
|
145 |
+
"model_dir": null,
|
146 |
+
"languages": [
|
147 |
+
"en",
|
148 |
+
"es",
|
149 |
+
"fr",
|
150 |
+
"de",
|
151 |
+
"it",
|
152 |
+
"pt",
|
153 |
+
"pl",
|
154 |
+
"tr",
|
155 |
+
"ru",
|
156 |
+
"nl",
|
157 |
+
"cs",
|
158 |
+
"ar",
|
159 |
+
"zh-cn",
|
160 |
+
"hu",
|
161 |
+
"ko",
|
162 |
+
"ja",
|
163 |
+
"hi",
|
164 |
+
"vi"
|
165 |
+
],
|
166 |
+
"temperature": 0.85,
|
167 |
+
"length_penalty": 1.0,
|
168 |
+
"repetition_penalty": 2.0,
|
169 |
+
"top_k": 50,
|
170 |
+
"top_p": 0.85,
|
171 |
+
"num_gpt_outputs": 1,
|
172 |
+
"gpt_cond_len": 12,
|
173 |
+
"gpt_cond_chunk_len": 4,
|
174 |
+
"max_ref_len": 10,
|
175 |
+
"sound_norm_refs": false
|
176 |
+
}
|
model.pth
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:534670e4b752002b7d7224e6ea1f467bd608c8dd3c36efaa45e1f4696e8bd1d2
|
3 |
+
size 1875343894
|
samples/nam-calm.wav
ADDED
Binary file (744 kB). View file
|
|
samples/nam-cham.wav
ADDED
Binary file (784 kB). View file
|
|
samples/nam-nhanh.wav
ADDED
Binary file (646 kB). View file
|
|
samples/nam-truyen-cam.wav
ADDED
Binary file (876 kB). View file
|
|
samples/nu-calm.wav
ADDED
Binary file (759 kB). View file
|
|
samples/nu-cham.wav
ADDED
Binary file (933 kB). View file
|
|
samples/nu-luu-loat.wav
ADDED
Binary file (711 kB). View file
|
|
samples/nu-nhan-nha.wav
ADDED
Binary file (764 kB). View file
|
|
samples/nu-nhe-nhang.wav
ADDED
Binary file (793 kB). View file
|
|
vi_sample.wav
ADDED
Binary file (793 kB). View file
|
|
vocab.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|