nateraw's picture
046d174 verified
pipeline_tag: text-to-audio
library_name: audiocraft
language: en
- text-to-audio
- musicgen
- songstarter
license: cc-by-nc-4.0
# Model Card for musicgen-songstarter-v0.2
[![Replicate demo and cloud API](]( [![Open In Colab](]( [![Open in Spaces](](
musicgen-songstarter-v0.2 is a [`musicgen-stereo-melody-large`]( fine-tuned on a dataset of melody loops from my Splice sample library. It's intended to be used to generate song ideas that are useful for music producers. It generates stereo audio in 32khz.
**👀 Update:** I wrote a [blogpost]( detailing how and why I trained this model, including training details, the dataset, Weights and Biases logs, etc.
Compared to [`musicgen-songstarter-v0.1`](, this new version:
- was trained on 3x more unique, manually-curated samples that I painstakingly purchased on Splice
- Is twice the size, bumped up from size `medium` ➡️ `large` transformer LM
If you find this model interesting, please consider:
- following me on [GitHub](
- following me on [Twitter](
## Usage
Install [audiocraft](
pip install -U git+
Then, you should be able to load this model just like any other musicgen checkpoint here on the Hub:
import torchaudio
from audiocraft.models import MusicGen
from import audio_write
model = MusicGen.get_pretrained('nateraw/musicgen-songstarter-v0.2')
model.set_generation_params(duration=8) # generate 8 seconds.
wav = model.generate_unconditional(4) # generates 4 unconditional audio samples
descriptions = ['acoustic, guitar, melody, trap, d minor, 90 bpm'] * 3
wav = model.generate(descriptions) # generates 3 samples.
melody, sr = torchaudio.load('./assets/bach.mp3')
# generates using the melody from the given audio and the provided descriptions.
wav = model.generate_with_chroma(descriptions, melody[None].expand(3, -1, -1), sr)
for idx, one_wav in enumerate(wav):
# Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)
## Prompt Format
Follow the following prompt format:
{tag_1}, {tag_2}, ..., {tag_n}, {key}, {bpm} bpm
For example:
hip hop, soul, piano, chords, jazz, neo jazz, G# minor, 140 bpm
For some example tags, [see the prompt format section of musicgen-songstarter-v0.1's readme]( The tags there are for the smaller v1 dataset, but should give you an idea of what the model saw.
## Samples
<table style="width:100%; text-align:center;">
<th>Audio Prompt</th>
<th>Text Prompt</th>
<audio controls>
<source src="" type="audio/wav">
Your browser does not support the audio element.
trap, synthesizer, songstarters, dark, G# minor, 140 bpm
<audio controls>
<source src="" type="audio/wav">
Your browser does not support the audio element.
<audio controls>
<source src="" type="audio/mp3">
Your browser does not support the audio element.
acoustic, guitar, melody, trap, D minor, 90 bpm
<audio controls>
<source src="" type="audio/wav">
Your browser does not support the audio element.
## Training Details
For more verbose details, you can check out the [blogpost](
- **code**:
- Repo is [here]( It's an undocumented fork of [facebookresearch/audiocraft]( where I rewrote the training loop with PyTorch Lightning, which worked a bit better for me.
- **data**:
- around 1700-1800 samples I manually listened to + purchased via my personal [Splice]( account. About 7-8 hours of audio.
- Given the licensing terms, I cannot share the data.
- **hardware**:
- 8xA100 40GB instance from Lambda Labs
- **procedure**:
- trained for 10k steps, which took about 6 hours
- reduced segment duration at train time to 15 seconds
- **hparams/logs**:
- See the wandb [run](, which includes training metrics, logs, hardware metrics at train time, hyperparameters, and the exact command I used when I ran the training script.
## Acknowledgements
This work would not have been possible without:
- [Lambda Labs](, for subsidizing larger training runs by providing some compute credits
- [Replicate](, for early development compute resources
Thank you ❤️