language: en
tags:
- multimodal
- text
- image
license: other
datasets:
- HuggingFaceM4/OBELISC
- wikipedia
- facebook/pmd
- laion/laion2B-en
TODO: logo?
Model Card for m4-80b
ATUM (Adapted Transformers for Unstructured Multimodal data) is an open-access reproduction of Flamingo, a closed-source visual language model developed by Deepmind. The multimodal model accepts arbitrary sequences of image and text inputs and produces text outputs and is built solely on public available data and models. ATUM (TODO) is on par with the original model on various image + text benchmarks, including visual question answering (open-ended and multiple choice), image captioning, and image classification when evaluated with in-context few-shot learning.
The model comes into two variants: a large 80 billion parameters version and a 9 billion parameters version. We also fine-tune these base models on a mixture of SFT datasets (TODO: find a more understandable characterization), which boosts the downstream performance while making the models more usable in conversational settings: (TODO: 80B-sfted) and (TODO: 9B sfted).
Table of Contents
- Model Card for m4-80b
- Table of Contents
- Model Details
- Uses
- Bias, Risks, and Limitations
- Training Details
- Evaluation
- Model Examination
- Environmental Impact
- Technical Specifications [optional]
- Citation
- Glossary [optional]
- More Information [optional]
- Model Card Authors [optional]
- Model Card Contact
- How to Get Started with the Model
Model Details
- Developed by: Hugging Face
- Model type: Multi-modal model (text+image)
- Language(s) (NLP): en
- License: other
- Parent Model: laion/CLIP-ViT-H-14-laion2B-s32B-b79K and huggyllama/llama-65b
- Resources for more information:
ATUM is a large multimodal model that takes sequences of interleaved images and texts as inputs and generates text outputs. The model shows strong in-context few-shot learning capabilities (and on par with the closed-source model), and is a robust starting point to fine-tune multimodal models on custom data.
ATUM is built on top of two unimodal open-access pre-trained models to connect the two modalities. Newly initialized parameters in the form of Transformer blocks bridge the gap between the vision encoder and the language model. The model is trained on a mixture of image/text pairs and unstrucutred multimodal web documents.
Uses
The model can be used to perform inference on multimodal (image + text) tasks in which the input is composed of a text query/instruction along with one or multiple images. This model does not support image generation.
It is possible to fine-tune the base model on custom data for a specific use-case. We note that the instruction-fine-tuned models are significantly better at following instructions and thus should be prefered when using the models out-of-the-box.
The following screenshot is an example of interaction with the model:
TODO: screenshot
How to Get Started with the Model
Use the code below to get started with the model.
Click to expand
More information needed
Training Details
We closel follow the training procedure layed out in Flamingo. We combine two open-source pre-trained models (laion/CLIP-ViT-H-14-laion2B-s32B-b79K and huggyllama/llama-65b) by initializing new Transformer blocks.
The model is trained on the following data mixture of openly accessible data:
| Data Source | Type of Data | Number of Tokens in Source | Number of Images in Source | Epochs | Effective Proportion in Number of Tokens | |-------------|----------------------------|------------|----------------------------|--------| | PMD | Image-Text Pairs | TODO | TODO | 3 | 73.85% | | LAION | Image-Text Pairs | TODO | TODO | 1 | 6.15% | | OBELISC | Unstructured Multimodal Web Documents | TODO | TODO | 3 | 2.82% | | Wikipedia | Unstructured Multimodal Web Documents | TODO | TODO | 1 | 17.18% |
For multimodal web documents, we feed the model sequences corresponding to the succession of text paragraphs and images. For image-text pairs, we form the training sequences by packing images with their captions. The images are encoded with the vision encoder and vision hidden states are pooled with Transformer Perceiver blocks and then fused into the text sequence through the cross-attention blocks. The training objective is the standard next token prediction.
Evaluation
We closely follow the evaluation protocol of Flamingo and evaluate ATUM on a suite of downstream image + text benchmarks ranging from visual question answering to image captioning. We compare our model to the original Flamingo along with OpenFlamingo, another open-source reproduction.
TODO: beautiful plots of shots scaling laws.
TODO: detail of the numbers in a table.
Bias, Risks, and Limitations
Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)). As a derivative of such a language model, ATUM can produce texts that include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups. Moreover, ATUM can produce factually incorrect texts, and should not be relied on to produce factually accurate information.
Here are a few examples of outputs that could be categorized as factually incorrect, biased, or offensive: TODO: give 4/5 representative examples
To measure ATUM's ability to recognize socilogical (TODO: find a better adjective) attributes, we evaluate the model on FairFace... TODO: include FairFace numbers
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: 64 nodes of 8x 80GB A100 gpus, EFA network
- Hours used: ~672 node hours
- Cloud Provider: AWS Sagemaker
- Carbon Emitted: unknown
Technical Specifications
Hardware
The training was performed on an AWS SageMaker cluster with 64 nodes of 8x80GB A100 GPUs (512 GPUs total). The cluster uses the current EFA network which provides about 340GBps throughput.
As the network is quite slow for the needs of DeepSpeed ZeRO-3 we were only able to clock ~90 TFLOPs.
Software
The training software is built on top of HuggingFace Transformers + Accelerate, and DeepSpeed ZeRO-3 for training, and WebDataset for data loading.
Citation
BibTeX:
More information needed
APA:
More information needed
Model Card Authors [optional]
V, i, c, t, o, r, ,, , S, t, a, s, ,, , X, X, X
Model Card Contact
Please open a discussion on the Community tab!