|
<!DOCTYPE html> |
|
<html> |
|
<head> |
|
<meta charset="utf-8" /> |
|
<meta name="viewport" content="width=device-width" /> |
|
<title>Riffusion-Melodiff-v1</title> |
|
<link rel="stylesheet" href="style.css" /> |
|
</head> |
|
<body> |
|
<div class="card"> |
|
<h1>Riffusion-Melodiff-v1</h1> |
|
<p><br> Riffusion-Melodiff is simple, but interesting idea, (that I have not seen anywhere else) how to create cover versions from songs.</p> |
|
<p><br> Riffusion-Melodiff is built on a top of |
|
<a href="https://huggingface.co/riffusion/riffusion-model-v1" target="_blank">Riffusion </a> |
|
model, which is fine-tuned Stable Diffusion model to generate Mel Spectrograms. (Spectrogram is kind of |
|
visual representation of music by dividing waveforms into frequencies.) Riffusion-Melodiff does not contain new model, there was no new training, nor fine-tunig. |
|
It uses the same model as Riffusion only in a different way.</p> |
|
<p>Riffusion-Melodiff uses Img2Img pipeline from Diffusers library to modify images of Mel Spectrograms to produce new versions of music. Just upload your audio |
|
in wav format (if you have audio in a different format, transfer it first to wav by online converter). Then you may use Img2img pipeline from the Diffusers library |
|
with your prompt, seed and strength. Stregth parameter decides, how much will modified audio relate to initial audio and how much it will relate to the prompt. |
|
When strength is too low the spectrogram is too similar with original one and we do not receive new modification. When strength is too high, then spectrogram is too |
|
close to the new promopt, which may cause loss of melody and/or tempo from the base image. Good values of strength are usually about 0,4-0,5.</p> |
|
<p>Good modifications are possible for proper prompt, seed and strength values. Those modifications will keep the tempo and melody from the initial audio, but |
|
they will change eg. instrument, playing that melody. Also with this pipeline longer than 5s music modifications are possible. If you cut your audio into 5s pieces |
|
and use the same prompt, seed and strength for each modification, generated samples will be somewhat consistent. So if you concatenate them together, you will have |
|
longer audio modified.</p> |
|
<p>Quality of the generated music is not amazing, (mediocre, I would say) and it needs a bit of prompt and seed engineering. But it shows one way, how to make cover |
|
versions of music in the future.</p> |
|
<p> |
|
Colab notebook is included, where you can find step by step, how to do it. |
|
<a href="https://huggingface.co/spaces/JanBabela/Riffusion-Melodiff-v1/blob/main/melodiff_v1.ipynb" target="_blank">Melodiff_v1</a>. |
|
</p> |
|
<p> <br> Examples of music generated by modifying the underlying song: <br> </p> |
|
<p> |
|
Amazing Grace, originally played by flute, modified to be played by violin |
|
<audio controls> |
|
<source src="Amazing_Grace_flute_i2i_violin.wav" type="audio/wav"> |
|
Your browser does not support the audio element. |
|
</audio> |
|
</p> |
|
<p> |
|
Bella Cao, originally played by violin, modified to be played by saxophone |
|
<audio controls> |
|
<source src="Bella_Cao_violin_i2i_sax.wav" type="audio/wav"> |
|
Your browser does not support the audio element. |
|
</audio> |
|
</p> |
|
<p> |
|
Iko iko, originally played by accordion, modified to be played by saxophone |
|
<audio controls> |
|
<source src="Iko_iko_accordion_i2i_sax.wav" type="audio/wav"> |
|
Your browser does not support the audio element. |
|
</audio> |
|
</p> |
|
<p> |
|
When the Saints, originally played by violin, modified to be sang by vocals |
|
<audio controls> |
|
<source src="When_the_Saints_violin_i2i_vocals.wav" type="audio/wav"> |
|
Your browser does not support the audio element. |
|
</audio> |
|
</p> |
|
<p> <br> Examples of longer music samples: <br> </p> |
|
<p> |
|
Iko iko, originally played by accordion, modified to be played by saxophone |
|
<audio controls> |
|
<source src="Iko_iko_long_accordion_i2i_sax.wav" type="audio/wav"> |
|
Your browser does not support the audio element. |
|
</audio> |
|
</p> |
|
<p> |
|
Iko iko, originally played by accordion, modified to be played by violin |
|
<audio controls> |
|
<source src="Iko_iko_long_sax_i2i_violin.wav" type="audio/wav"> |
|
Your browser does not support the audio element. |
|
</audio> |
|
</p> |
|
<p> |
|
When the Saints, originally played by piano, modified to be played by flute |
|
<audio controls> |
|
<source src="When_the_Saints_long_piano_i2i_flute.wav" type="audio/wav"> |
|
Your browser does not support the audio element. |
|
</audio> |
|
</p> |
|
<p> <br> Im using standard (not paid) Google Colab Gpu configuration for inference. Im using default values for number of inference steps (23) from the underlying |
|
pipelines. With this setup it takes about 8s to produce 5s long modified sample. For start it is ok, I would say.</p> |
|
</div> |
|
</body> |
|
</html> |
|
|