audio-flamingo / audio flamingo model card.md
ZhifengKong's picture
update
9e8c151 verified

Model Overview

Description:

Audio Flamingo is a novel audio-understanding language model for

  • understanding audio,
  • quickly adapting to unseen tasks via in-context learning and retrieval, and
  • understanding and responding to multi-turn dialogues

We introduce a series of training techniques, architecture design, and data strategies to enhance our model with these abilities. Extensive evaluations across various audio understanding tasks confirm the efficacy of our method, setting new state-of-the-art benchmarks.

This model is ready for non-commercial research-only.

References(s):

Model Architecture:

Architecture Type: Transformer
Network Architecture: Audio Flamingo

Audio Flamingo is a Flamingo-style architecture with frozen audio feature extractor, trainable transformation layers and xattn-dense layers, and language model layers.

Input:

Input Types: Audio, Text
Input Format: Wav/MP3/Flac, String
Input Parameters: None
Maximum Audio Input Lengths: 33.25 seconds
Maximum Text Input Lengths: 512 tokens

Output:

Output Type: Text
Output Format: String
Output Parameters: None

Software Integration:

Runtime Engine(s): PyTorch

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Hopper

Preferred/Supported Operating System(s):

  • Linux

Model Version(s):

  • v1.0

Training, Testing, and Evaluation Datasets:

Training Dataset:

Audio Flamingo is trained with publicly available datasets under various licenses, with the most restricted ones being non-commercial/research-only. The dataset contains diverse audio types including speech, environmental sounds, and music.

For all of these datasets, the data collection method is [human]. For OpenAQA, Laion630k, LP-MusicCaps, WavCaps, MusicQA, the data labeling method is [synthetic]. For the rest, the data labeling method is [human].

Evaluating Dataset:

Audio Flamingo is evaluated on the test split of the following datasets.

For all of these datasets, the data collection method is [human] and the data labeling method is [human].

Inference

Engine: HuggingFace Transformers
Test Hardware [Name the specific test hardware model]: A100 80GB