--- base_model: - stabilityai/stable-audio-open-1.0 language: - en tags: - audio - music - music-generation ---

SAO Instrumental Finetune

## Dataset The model was trained over a custom synthetized audio dataset. The dataset creation process was the following: 1. **Getting MIDI files:** We started with [The Lakh MIDI Dataset v0.1](https://colinraffel.com/projects/lmd/), specifically using its "Clean MIDI subset." We removed duplicate songs and split the MIDI files into individual tracks using [pretty_midi](https://craffel.github.io/pretty-midi/). 2. **Rendering the MIDI files:** Using [pedalboard](https://spotify.github.io/pedalboard/), we loaded VST3 instruments in Python, and sent MIDI events to each instrument to generate a WAV file for each track. The rendered tracks varied in volume without any musical criterion, so we normalized them using [pyloudnorm](https://www.christiansteinmetz.com/projects-blog/pyloudnorm). To combine all instrument audio files into a single track, we used a simple average across all tracks. 3. **Extracting metadata:** Each MIDI file is named by song and artist, and from our scripts, we already knew the instruments in use. However, to create descriptive prompts, more data was needed. Using the final mix of each MIDI file, we ran [deeprhythm](https://bleu.green/deeprhythm/) to predict the tempo and [essentia's key and scale detector](https://essentia.upf.edu/tutorial_tonal_hpcpkeyscale.html) to identify the key and scale. Then, with the song title and artist, we used Spotify’s API to fetch additional metadata, such as _energy_, _acousticness_, and _instrumentalness_. We also retrieved a list of "Sections" from Spotify to split songs (usually 2-4 minutes long) into shorter 30-second clips, ensuring musical coherence and avoiding cuts within phrases. We consulted the LastFM API for user-assigned tags, which typically provided genre information and additional descriptors like "melancholic," "warm," or "gentle," essential for characterizing the mood and style beyond what could be inferred from audio or MIDI files alone. The final result was a JSON file for each audio, which looked like this: ```json { "Song": "Here Comes The Sun - Remastered 2009", "Artist": "The Beatles", "Tags": [ "sunshine pop", "rock", "60s", // ... ], "Instruments": [ "Acoustic Guitar", "Drums" ] "duration_ms": 185733, "acousticness": 0.9339, "energy": 0.54, "key": "A", "mode": "Major", "tempo": 128, "sections": [ { "start": 0.0, "duration": 16.22566 }, // ... ] } ``` 4. **Generating prompts:** With all metadata gathered, we generated prompts. One option would have been simply concatenating all data, but that would lead to fixed, overly rigid prompts, which would detract from the model’s flexibility to understand natural language. To avoid this, we used Llama 3.1 to transform the JSON metadata into more natural, human-like prompts. Both the system and user prompts were dynamically generated with randomized attribute order to avoid any fixed structure, and we provided a few-shot examples to improve output quality. Here’s an example of a generated prompt:

"A sunshine pop/rock song of the 60s, driven by acoustic guitar and drums, set in the key of A major, with a lively tempo of 128 BPM."

Since all prompts contain similar metadata, we recommend always including in the prompt the genre, some descriptive adjectives, and the instruments you want featured according to the [MIDI Instrument Names](https://www.ccarh.org/courses/253/handout/gminstruments/). It wasn’t trained with every single MIDI instrument, so some will perform better than others. You can also specify tempo, key, and scale. While it responds well to tempo cues, it is less responsive to key and scale adjustments. Using this pipeline, we created two datasets: a monophonic one with ~10 minutes of each instrument playing solo and a polyphonic one with multiple instruments. After creating 4 hours of audio, we trained our first model to test the training pipeline, check resources, and assess the impact of finetuning. We trained it for 1 hour on an A100 GPU (40 GB) with `batch size = 8`, completing 2,000 steps in 36 epochs. This model is saved as `first_training_test.ckpt`. The training test was successful; even with just 4 hours of training data and 1 hour of training, listening evaluations indicated that the generations adapted to our dataset. We then generated a few more hours of audio using the same pipeline. However, the synthesized MIDI sounds often had an artificial quality due to the nature of the instruments, note playstyles, mixing, etc. We observed the model starting to generate this artificial sound, which we wanted to avoid. To address this, we created a third dataset using non-copyrighted YouTube content available for commercial use, manually tagging the metadata. The final dataset (monophonic + polyphonic + YouTube) is outlined below:

	Monophonic dataset	Polyphonic dataset	YouTube dataset	Total
# audios	298	399	326	1023
Average duration	28.4 sec.	32.4 sec.	33.6 sec.	-
Total duration	141 min.	215 min.	182 min.	538 min. (9h)

## Training Using all three datasets, we conducted the main training, allowing it to run for 5 hours on an A100 GPU with `batch size = 16`. In those 5 hours, we completed 4,000 steps over 63 epochs. ## Results ### Listening test Our first evaluation was an informal listening comparison between Stable Audio Open (SAO) and our Instrumental Finetune using new prompts following the recommendations stated above. The results were promising: the generated audio better reflected the specified instruments, genre, and overall feel. Here are a few examples across different genres: #### Jazz "An upbeat Jazz piece featuring Swing Drums, Trumpet, and Piano. In the key of B Major at 120 BPM, this track brings a lively energy with playful trumpet melodies and syncopated rhythms that invite listeners to tap their feet." | | Stable Audio Open (SAO) | SAO Instrumental Finetune | |---|---|---| | Generated audio |

	Stable Audio Open (SAO)	SAO Instrumental Finetune
Accuracy 1	0.747	0.876
Accuracy 2	0.776	0.953