maskgct / egs /datasets /README.md
Hecheng0625's picture
Upload 167 files
8c92a11 verified
# Datasets Format
Amphion support the following academic datasets (sort alphabetically):
- [Datasets Format](#datasets-format)
- [AudioCaps](#audiocaps)
- [CSD](#csd)
- [CustomSVCDataset](#customsvcdataset)
- [Hi-Fi TTS](#hifitts)
- [KiSing](#kising)
- [LibriLight](#librilight)
- [LibriTTS](#libritts)
- [LJSpeech](#ljspeech)
- [M4Singer](#m4singer)
- [NUS-48E](#nus-48e)
- [Opencpop](#opencpop)
- [OpenSinger](#opensinger)
- [Opera](#opera)
- [PopBuTFy](#popbutfy)
- [PopCS](#popcs)
- [PJS](#pjs)
- [SVCC](#svcc)
- [VCTK](#vctk)
The downloading link and the file structure tree of each dataset is displayed as follows.
> **Note:** When using Docker to run Amphion, mount the dataset to the container is necessary after downloading. Check [Mount dataset in Docker container](./docker.md) for more details.
## AudioCaps
AudioCaps is a dataset of around 44K audio-caption pairs, where each audio clip corresponds to a caption with rich semantic information.
Download AudioCaps dataset [here](https://github.com/cdjkim/audiocaps). The file structure looks like below:
```plaintext
[AudioCaps dataset path]
┣ AudioCpas
┃ ┣ wav
┃ ┃ ┣ ---1_cCGK4M_0_10000.wav
┃ ┃ ┣ ---lTs1dxhU_30000_40000.wav
┃ ┃ ┣ ...
```
## CSD
Download the official CSD dataset [here](https://zenodo.org/records/4785016). The file structure looks like below:
```plaintext
[CSD dataset path]
┣ english
┣ korean
┣ utterances
┃ ┣ en001a
┃ ┃ ┣ {UtterenceID}.wav
┃ ┣ en001b
┃ ┣ en002a
┃ ┣ en002b
┃ ┣ ...
┣ README
```
## CustomSVCDataset
We support custom dataset for Singing Voice Conversion. Organize your data in the following structure to construct your own dataset:
```plaintext
[Your Custom Dataset Path]
┣ singer1
┃ ┣ song1
┃ ┃ ┣ utterance1.wav
┃ ┃ ┣ utterance2.wav
┃ ┃ ┣ ...
┃ ┣ song2
┃ ┣ ...
┣ singer2
┣ ...
```
## Hi-Fi TTS
Download the official Hi-Fi TTS dataset [here](https://www.openslr.org/109/). The file structure looks like below:
```plaintext
[Hi-Fi TTS dataset path]
┣ audio
┃ ┣ 11614_other {Speaker_ID}_{SNR_subset}
┃ ┃ ┣ 10547 {Book_ID}
┃ ┃ ┃ ┣ thousandnights8_04_anonymous_0001.flac
┃ ┃ ┃ ┣ thousandnights8_04_anonymous_0003.flac
┃ ┃ ┃ ┣ thousandnights8_04_anonymous_0004.flac
┃ ┃ ┃ ┣ ...
┃ ┃ ┣ ...
┃ ┣ ...
┣ 92_manifest_clean_dev.json
┣ 92_manifest_clean_test.json
┣ 92_manifest_clean_train.json
┣ ...
┣ {Speaker_ID}_manifest_{SNR_subset}_{dataset_split}.json
┣ ...
┣ books_bandwidth.tsv
┣ LICENSE.txt
┣ readers_books_clean.txt
┣ readers_books_other.txt
┣ README.txt
```
## KiSing
Download the official KiSing dataset [here](http://shijt.site/index.php/2021/05/16/kising-the-first-open-source-mandarin-singing-voice-synthesis-corpus/). The file structure looks like below:
```plaintext
[KiSing dataset path]
┣ clean
┃ ┣ 421
┃ ┣ 422
┃ ┣ ...
```
## LibriLight
Download the official LibriLight dataset [here](https://github.com/facebookresearch/libri-light). The file structure looks like below:
```plaintext
[LibriTTS dataset path]
┣ small (Subset)
┃ ┣ 100 {Speaker_ID}
┃ ┃ ┣ sea_fairies_0812_librivox_64kb_mp3 {Chapter_ID}
┃ ┃ ┃ ┣ 01_baum_sea_fairies_64kb.flac
┃ ┃ ┃ ┣ 02_baum_sea_fairies_64kb.flac
┃ ┃ ┃ ┣ 03_baum_sea_fairies_64kb.flac
┃ ┃ ┃ ┣ 22_baum_sea_fairies_64kb.flac
┃ ┃ ┃ ┣ 01_baum_sea_fairies_64kb.json
┃ ┃ ┃ ┣ 02_baum_sea_fairies_64kb.json
┃ ┃ ┃ ┣ 03_baum_sea_fairies_64kb.json
┃ ┃ ┃ ┣ 22_baum_sea_fairies_64kb.json
┃ ┃ ┃ ┣ ...
┃ ┃ ┣ ...
┃ ┣ ...
┣ medium (Subset)
┣ ...
```
## LibriTTS
Download the official LibriTTS dataset [here](https://www.openslr.org/60/). The file structure looks like below:
```plaintext
[LibriTTS dataset path]
┣ BOOKS.txt
┣ CHAPTERS.txt
┣ eval_sentences10.tsv
┣ LICENSE.txt
┣ NOTE.txt
┣ reader_book.tsv
┣ README_librispeech.txt
┣ README_libritts.txt
┣ speakers.tsv
┣ SPEAKERS.txt
┣ dev-clean (Subset)
┃ ┣ 1272{Speaker_ID}
┃ ┃ ┣ 128104 {Chapter_ID}
┃ ┃ ┃ ┣ 1272_128104_000001_000000.normalized.txt
┃ ┃ ┃ ┣ 1272_128104_000001_000000.original.txt
┃ ┃ ┃ ┣ 1272_128104_000001_000000.wav
┃ ┃ ┃ ┣ ...
┃ ┃ ┃ ┣ 1272_128104.book.tsv
┃ ┃ ┃ ┣ 1272_128104.trans.tsv
┃ ┃ ┣ ...
┃ ┣ ...
┣ dev-other (Subset)
┃ ┣ 116 (Speaker)
┃ ┃ ┣ 288045 {Chapter_ID}
┃ ┃ ┃ ┣ 116_288045_000003_000000.normalized.txt
┃ ┃ ┃ ┣ 116_288045_000003_000000.original.txt
┃ ┃ ┃ ┣ 116_288045_000003_000000.wav
┃ ┃ ┃ ┣ ...
┃ ┃ ┃ ┣ 116_288045.book.tsv
┃ ┃ ┃ ┣ 116_288045.trans.tsv
┃ ┃ ┣ ...
┃ ┣ ...
┃ ┣ ...
┣ test-clean (Subset)
┃ ┣ {Speaker_ID}
┃ ┃ ┣ {Chapter_ID}
┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.normalized.txt
┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.original.txt
┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.wav
┃ ┃ ┃ ┣ ...
┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.book.tsv
┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.trans.tsv
┃ ┃ ┣ ...
┃ ┣ ...
┣ test-other
┃ ┣ {Speaker_ID}
┃ ┃ ┣ {Chapter_ID}
┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.normalized.txt
┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.original.txt
┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.wav
┃ ┃ ┃ ┣ ...
┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.book.tsv
┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.trans.tsv
┃ ┃ ┣ ...
┃ ┣ ...
┣ train-clean-100
┃ ┣ {Speaker_ID}
┃ ┃ ┣ {Chapter_ID}
┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.normalized.txt
┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.original.txt
┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.wav
┃ ┃ ┃ ┣ ...
┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.book.tsv
┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.trans.tsv
┃ ┃ ┣ ...
┃ ┣ ...
┣ train-clean-360
┃ ┣ {Speaker_ID}
┃ ┃ ┣ {Chapter_ID}
┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.normalized.txt
┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.original.txt
┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.wav
┃ ┃ ┃ ┣ ...
┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.book.tsv
┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.trans.tsv
┃ ┃ ┣ ...
┃ ┣ ...
┣ train-other-500
┃ ┣ {Speaker_ID}
┃ ┃ ┣ {Chapter_ID}
┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.normalized.txt
┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.original.txt
┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.wav
┃ ┃ ┃ ┣ ...
┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.book.tsv
┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.trans.tsv
┃ ┃ ┣ ...
┃ ┣ ...
```
## LJSpeech
Download the official LJSpeech dataset [here](https://keithito.com/LJ-Speech-Dataset/). The file structure looks like below:
```plaintext
[LJSpeech dataset path]
┣ metadata.csv
┣ wavs
┃ ┣ LJ001-0001.wav
┃ ┣ LJ001-0002.wav
┃ ┣ ...
┣ README
```
## M4Singer
Download the official M4Singer dataset [here](https://drive.google.com/file/d/1xC37E59EWRRFFLdG3aJkVqwtLDgtFNqW/view). The file structure looks like below:
```plaintext
[M4Singer dataset path]
┣ {Singer_1}#{Song_1}
┃ ┣ 0000.mid
┃ ┣ 0000.TextGrid
┃ ┣ 0000.wav
┃ ┣ ...
┣ {Singer_1}#{Song_2}
┣ ...
┣ {Singer_2}#{Song_1}
┣ {Singer_2}#{Song_2}
┣ ...
β”— meta.json
```
## NUS-48E
Download the official NUS-48E dataset [here](https://drive.google.com/drive/folders/12pP9uUl0HTVANU3IPLnumTJiRjPtVUMx). The file structure looks like below:
```plaintext
[NUS-48E dataset path]
┣ {SpeakerID}
┃ ┣ read
┃ ┃ ┣ {SongID}.txt
┃ ┃ ┣ {SongID}.wav
┃ ┃ ┣ ...
┃ ┣ sing
┃ ┃ ┣ {SongID}.txt
┃ ┃ ┣ {SongID}.wav
┃ ┃ ┣ ...
┣ ...
┣ README.txt
```
## Opencpop
Download the official Opencpop dataset [here](https://wenet.org.cn/opencpop/). The file structure looks like below:
```plaintext
[Opencpop dataset path]
┣ midis
┃ ┣ 2001.midi
┃ ┣ 2002.midi
┃ ┣ 2003.midi
┃ ┣ ...
┣ segments
┃ ┣ wavs
┃ ┃ ┣ 2001000001.wav
┃ ┃ ┣ 2001000002.wav
┃ ┃ ┣ 2001000003.wav
┃ ┃ ┣ ...
┃ ┣ test.txt
┃ ┣ train.txt
┃ β”— transcriptions.txt
┣ textgrids
┃ ┣ 2001.TextGrid
┃ ┣ 2002.TextGrid
┃ ┣ 2003.TextGrid
┃ ┣ ...
┣ wavs
┃ ┣ 2001.wav
┃ ┣ 2002.wav
┃ ┣ 2003.wav
┃ ┣ ...
┣ TERMS_OF_ACCESS
β”— readme.md
```
## OpenSinger
Download the official OpenSinger dataset [here](https://drive.google.com/file/d/1EofoZxvalgMjZqzUEuEdleHIZ6SHtNuK/view). The file structure looks like below:
```plaintext
[OpenSinger dataset path]
┣ ManRaw
┃ ┣ {Singer_1}_{Song_1}
┃ ┃ ┣ {Singer_1}_{Song_1}_0.lab
┃ ┃ ┣ {Singer_1}_{Song_1}_0.txt
┃ ┃ ┣ {Singer_1}_{Song_1}_0.wav
┃ ┃ ┣ ...
┃ ┣ {Singer_1}_{Song_2}
┃ ┣ ...
┣ WomanRaw
┣ LICENSE
β”— README.md
```
## Opera
Download the official Opera dataset [here](http://isophonics.net/SingingVoiceDataset). The file structure looks like below:
```plaintext
[Opera dataset path]
┣ monophonic
┃ ┣ chinese
┃ ┃ ┣ {Gender}_{SingerID}
┃ ┃ ┃ ┣ {Emotion}_{SongID}.wav
┃ ┃ ┃ ┣ ...
┃ ┃ ┣ ...
┃ ┣ western
┣ polyphonic
┃ ┣ chinese
┃ ┣ western
┣ CrossculturalDataSet.xlsx
```
## PopBuTFy
Download the official PopBuTFy dataset [here](https://github.com/MoonInTheRiver/NeuralSVB). The file structure looks like below:
```plaintext
[PopBuTFy dataset path]
┣ data
┃ ┣ {SingerID}#singing#{SongName}_Amateur
┃ ┃ ┣ {SingerID}#singing#{SongName}_Amateur_{UtteranceID}.mp3
┃ ┃ ┣ ...
┃ ┣ {SingerID}#singing#{SongName}_Professional
┃ ┃ ┣ {SingerID}#singing#{SongName}_Professional_{UtteranceID}.mp3
┃ ┃ ┣ ...
┣ text_labels
β”— TERMS_OF_ACCESS
```
## PopCS
Download the official PopCS dataset [here](https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/apply_form.md). The file structure looks like below:
```plaintext
[PopCS dataset path]
┣ popcs
┃ ┣ popcs-{SongName}
┃ ┃ ┣ {UtteranceID}_ph.txt
┃ ┃ ┣ {UtteranceID}_wf0.wav
┃ ┃ ┣ {UtteranceID}.TextGrid
┃ ┃ ┣ {UtteranceID}.txt
┃ ┃ ┣ ...
┃ ┣ ...
β”— TERMS_OF_ACCESS
```
## PJS
Download the official PJS dataset [here](https://sites.google.com/site/shinnosuketakamichi/research-topics/pjs_corpus). The file structure looks like below:
```plaintext
[PJS dataset path]
┣ PJS_corpus_ver1.1
┃ ┣ background_noise
┃ ┣ pjs{SongID}
┃ ┃ ┣ pjs{SongID}_song.wav
┃ ┃ ┣ pjs{SongID}_speech.wav
┃ ┃ ┣ pjs{SongID}.lab
┃ ┃ ┣ pjs{SongID}.mid
┃ ┃ ┣ pjs{SongID}.musicxml
┃ ┃ ┣ pjs{SongID}.txt
┃ ┣ ...
```
## SVCC
Download the official SVCC dataset [here](https://github.com/lesterphillip/SVCC23_FastSVC/tree/main/egs/generate_dataset). The file structure looks like below:
```plaintext
[SVCC dataset path]
┣ Data
┃ ┣ CDF1
┃ ┃ ┣ 10001.wav
┃ ┃ ┣ 10002.wav
┃ ┃ ┣ ...
┃ ┣ CDM1
┃ ┣ IDF1
┃ ┣ IDM1
β”— README.md
```
## VCTK
Download the official VCTK dataset [here](https://datashare.ed.ac.uk/handle/10283/3443). The file structure looks like below:
```plaintext
[VCTK dataset path]
┣ txt
┃ ┣ {Speaker_1}
┃ ┃ ┣ {Speaker_1}_001.txt
┃ ┃ ┣ {Speaker_1}_002.txt
┃ ┃ ┣ ...
┃ ┣ {Speaker_2}
┃ ┣ ...
┣ wav48_silence_trimmed
┃ ┣ {Speaker_1}
┃ ┃ ┣ {Speaker_1}_001_mic1.flac
┃ ┃ ┣ {Speaker_1}_001_mic2.flac
┃ ┃ ┣ {Speaker_1}_002_mic1.flac
┃ ┃ ┣ ...
┃ ┣ {Speaker_2}
┃ ┣ ...
┣ speaker-info.txt
β”— update.txt
```