yizhilll commited on
Commit
3bb10de
·
1 Parent(s): bccff53

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +105 -0
README.md ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ inference: false
4
+ tags:
5
+ - music
6
+ ---
7
+
8
+ # Introduction to our series work
9
+
10
+ The development log of our Music Audio Pre-training (m-a-p) model family:
11
+ - 17/03/2023: we release two advanced music understanding models, [MERT-v1-95M](https://huggingface.co/m-a-p/MERT-v1-95M) and [MERT-v1-330M](https://huggingface.co/m-a-p/MERT-v1-330M) , trained with new paradigm and dataset. They outperform the previous models and can better generalize to more tasks.
12
+ - 14/03/2023: we retrained the MERT-v0 model with open-source-only music dataset [MERT-v0-public](https://huggingface.co/m-a-p/MERT-v0-public)
13
+ - 29/12/2022: a music understanding model [MERT-v0](https://huggingface.co/m-a-p/MERT-v0) trained with **MLM** paradigm, which performs better at downstream tasks.
14
+ - 29/10/2022: a pre-trained MIR model [music2vec](https://huggingface.co/m-a-p/music2vec-v1) trained with **BYOL** paradigm.
15
+
16
+
17
+
18
+ Here is a table for quick model pick-up:
19
+
20
+ | Name | Pre-train Paradigm | Training Data (hour) | Pre-train Context (second) | Model Size | Transformer Layer-Dimension | Feature Rate | Sample Rate | Release Date |
21
+ | ------------------------------------------------------------ | ------------------ | -------------------- | ---------------------------- | ---------- | --------------------------- | ------------ | ----------- | ------------ |
22
+ | [MERT-v1-330M](https://huggingface.co/m-a-p/MERT-v1-330M) | MLM | 160K | 5 | 330M | 24-1024 | 75 Hz | 24K Hz | 17/03/2023 |
23
+ | [MERT-v1-95M](https://huggingface.co/m-a-p/MERT-v1-95M) | MLM | 20K | 5 | 95M | 12-768 | 75 Hz | 24K Hz | 17/03/2023 |
24
+ | [MERT-v0-public](https://huggingface.co/m-a-p/MERT-v0-public) | MLM | 900 | 5 | 95M | 12-768 | 50 Hz | 16K Hz | 14/03/2023 |
25
+ | [MERT-v0](https://huggingface.co/m-a-p/MERT-v0) | MLM | 1000 | 5 | 95 M | 12-768 | 50 Hz | 16K Hz | 29/12/2023 |
26
+ | [music2vec-v1](https://huggingface.co/m-a-p/music2vec-v1) | BYOL | 1000 | 30 | 95 M | 12-768 | 50 Hz | 16K Hz | 30/10/2022 |
27
+
28
+ ## Explanation
29
+
30
+ The m-a-p models share the similar model architecture and the most distinguished difference is the paradigm in used pre-training. Other than that, there are several nuance technical configuration needs to know before using:
31
+
32
+ - **Model Size**: the number of parameters that would be loaded to memory. Please select the appropriate size fitting your hardware.
33
+ - **Transformer Layer-Dimension**: The number of transformer layers and the corresponding feature dimensions can be outputted from our model. This is marked out because features extracted by **different layers could have various performance depending on tasks**.
34
+ - **Feature Rate**: Given a 1-second audio input, the number of features output by the model.
35
+ - **Sample Rate**: The frequency of audio that the model is trained with.
36
+
37
+
38
+
39
+ # Introduction to MERT-v1
40
+
41
+ Compared to MERT-v0, we introduce multiple new things in the MERT-v1 pre-training:
42
+
43
+ - Change the pseudo labels to 8 codebooks from [encodec](https://github.com/facebookresearch/encodec), which potentially has higher quality and empower our model to support music generation.
44
+ - MLM prediction with in-batch noise mixture.
45
+ - Train with higher audio frequency (24K Hz).
46
+ - Train with more audio data (up to 160 thousands of hours).
47
+ - More available model sizes 95M and 330M.
48
+
49
+
50
+
51
+ More details will be written in our coming-soon paper.
52
+
53
+
54
+
55
+ # Model Usage
56
+
57
+ ```python
58
+ from transformers import Wav2Vec2Processor
59
+ from transformers import AutoModel
60
+ import torch
61
+ from torch import nn
62
+ from datasets import load_dataset
63
+
64
+ # load demo audio and set processor
65
+ dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
66
+ dataset = dataset.sort("id")
67
+ sampling_rate = dataset.features["audio"].sampling_rate
68
+ processor = Wav2Vec2Processor.from_pretrained("facebook/hubert-large-ls960-ft")
69
+
70
+ # loading our model weights
71
+ commit_hash='bccff5376fc07235d88954b43e5cd739fbc0796b' # this is recommended for security reason, the hash might be updated
72
+ model = AutoModel.from_pretrained("m-a-p/MERT-v1-95M", trust_remote_code=True, revision=commit_hash)
73
+
74
+ # audio file is decoded on the fly
75
+ inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
76
+ with torch.no_grad():
77
+ outputs = model(**inputs, output_hidden_states=True)
78
+
79
+ # take a look at the output shape, there are 13 layers of representation
80
+ # each layer performs differently in different downstream tasks, you should choose empirically
81
+ all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze()
82
+ print(all_layer_hidden_states.shape) # [13 layer, 292 timestep, 768 feature_dim]
83
+
84
+ # for utterance level classification tasks, you can simply reduce the representation in time
85
+ time_reduced_hidden_states = all_layer_hidden_states.mean(-2)
86
+ print(time_reduced_hidden_states.shape) # [13, 768]
87
+
88
+ # you can even use a learnable weighted average representation
89
+ aggregator = nn.Conv1d(in_channels=13, out_channels=1, kernel_size=1)
90
+ weighted_avg_hidden_states = aggregator(time_reduced_hidden_states.unsqueeze(0)).squeeze()
91
+ print(weighted_avg_hidden_states.shape) # [768]
92
+ ```
93
+
94
+
95
+
96
+ # Citation
97
+
98
+ ```shell
99
+ @article{li2022large,
100
+ title={Large-Scale Pretrained Model for Self-Supervised Music Audio Representation Learning},
101
+ author={Li, Yizhi and Yuan, Ruibin and Zhang, Ge and Ma, Yinghao and Lin, Chenghua and Chen, Xingran and Ragni, Anton and Yin, Hanzhi and Hu, Zhijie and He, Haoyu and others},
102
+ year={2022}
103
+ }
104
+
105
+ ```