Feature Extraction
Transformers
Safetensors
diva
custom_code

inference example

#1
by eschmidbauer - opened

Hello, thank you for sharing this code!
Do you have example inference code ? id like to test this model on my own server.

Of course! You'll need pip install transformers librosa accelerate wget

from transformers import AutoModel
import librosa
import wget

filename = wget.download("https://github.com/ffaisal93/SD-QA/raw/refs/heads/master/dev/eng/irl/wav_eng/-1008642825401516622.wav")

speech_data, _ = librosa.load(filename, sr=16_000)

model = AutoModel.from_pretrained("WillHeld/DiVA-llama-3-v0-8b", trust_remote_code=True)

output = model.generate(audio=speech_data, text_prompt="Respond like a pirate!")

I tested that on a Google Cloud 40GB A100 (same hardware we are hosting the demo on for now), but ymmv on other hardware. I'm just relying on HuggingFace accelerate for most of the distribution across accelerators!

Lmk if you hit snags with that, happy to help change stuff to make it more straightforward!

one step missing is:
os.environ["HF_TOKEN"] = "********"

Because the meta-llama/Meta-Llama-3-8B-Instruct model is gated and it seems DiVA-llama-3-v0-8b downloads it during the steps above

Ah, one option Id recommend is to use huggingface-cli login instead of adding the token to your code! This will persist your access token so you don't need to add your token to multiple scripts (and decreases risk of accidentally pushing a secret with your code).

https://huggingface.co/docs/huggingface_hub/en/guides/cli

the download of the audio is not working, this is for clone the voice? like i can use my own audio in english?

Hi Guilherme,

No, DiVA is not currently a text to speech model and we have no plans to support voice cloning. It takes speech as input and replies conversationally with text.

If you are looking for text-to-speech, you may consider looking at TTS initiatives likehttps://github.com/collabora/WhisperSpeech or https://huggingface.co/parler-tts if that is your interest!

@WillHeld
Sorry I have two questions I was wondering if you could answer.

1- Should I quantize it using B&B or does it degrade performance and pplx by a large margin? I ask because quantization usually don't take kindly to models equipped with an Encoder based on my experience.
2- and may I ask you to do a notebook on fine-tuning this model (full-parameter finetuning if possible, since the model itself isn't too big, but PEFT is also appreciated) hopefully using hf trainer or PyTorch this time around. if you have the opportunity to do so?

I really appreciate it since this is such an interesting work.

Hi!

On 1) I haven't tried any quantization myself so don't have great signal on this! It seems people quantize Whisper, so if I were to guess it's possible without too much degradation but I really don't know.

On 2) You can find all the training code here: https://github.com/Helw150/levanter/tree/will/distill but as you hinted at its all in Jax. Levanter supports LoRA as well for PEFT, so the functionality is all there.

Unfortunately, I don't have it in my roadmap to reproduce the full training stack in PyTorch since I rely on the TPU Research Cloud for my compute resources. Jax is much better supported there & models from Levanter are exported to the safetensors format so is easily usable from other frameworks at inference time. If you want an out of the box training solution, I'd suggest using Levanter (it supports GPU as well). Here's a doc on how to get set up for audio: https://levanter.readthedocs.io/en/latest/tutorials/Training-On-Audio-Data/

If PyTorch training is a must, you should be able to place the PyTorch conversion of DiVA here into any HuggingFace/PyTorch trainer loop though! Everything in modeling_diva.py is PyTorch and differentiable so will work with standard forward + backward passes.

Sign up or log in to comment