File size: 6,519 Bytes
e998648 eeaa1ca e998648 2f28ef4 e998648 218fabb e998648 218fabb e998648 218fabb 2f28ef4 e998648 b3c9f6e e998648 00c57f4 e998648 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 |
---
language: en
datasets:
- laion2b
---
# OpenFlamingo-9B (CLIP ViT-L/14, MPT-7B)
[Blog post]() | [Code](https://github.com/mlfoundations/open_flamingo) | [Demo](https://huggingface.co/spaces/openflamingo/OpenFlamingo)
OpenFlamingo is an open source implementation of DeepMind's [Flamingo](https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model) models.
This 9B-parameter model uses a [CLIP ViT-L/14](https://huggingface.co/openai/clip-vit-large-patch14) vision encoder and [MPT-7B](https://huggingface.co/mosaicml/mpt-7b) language model.
## Model Details
We follow the Flamingo modeling paradigm, outfitting the layers of a pretrained, frozen language model such that they cross-attend to visual features when decoding. Following Flamingo, we freeze the vision encoder and language model but train the connecting modules on web-scraped image-text sequences. Specifically, we trained this model on a mixture of [LAION-2B](https://arxiv.org/abs/2210.08402) and [Multimodal C4](https://arxiv.org/abs/2304.06939).
This model has cross-attention modules inserted in *every fourth* decoder block. It was trained using DistributedDataParallel across 64 A100 80GB GPUs at automatic BF16 mixed precision.
To use these MPT weights, OpenFlamingo must be initialized using revision `68e1a8e0ebb9b30f3c45c1ef6195980f29063ae2` of the MPT-7B modeling code. We suggest using [this copy of the model](https://huggingface.co/anas-awadalla/mpt-7b) to ensure the code is loaded at that commit.
## Uses
OpenFlamingo models process arbitrarily interleaved sequences of images and text to output text. This allows the models to accept in-context examples and undertake tasks like captioning, visual question answering, and image classification.
### Initialization
``` python
from open_flamingo import create_model_and_transforms
model, image_processor, tokenizer = create_model_and_transforms(
clip_vision_encoder_path="ViT-L-14",
clip_vision_encoder_pretrained="openai",
lang_encoder_path="anas-awadalla/mpt-7b",
tokenizer_path="anas-awadalla/mpt-7b",
cross_attn_every_n_layers=4
)
# grab model checkpoint from huggingface hub
from huggingface_hub import hf_hub_download
import torch
checkpoint_path = hf_hub_download("openflamingo/OpenFlamingo-9B-vitl-mpt7b", "checkpoint.pt")
model.load_state_dict(torch.load(checkpoint_path), strict=False)
```
### Generation example
Below is an example of generating text conditioned on interleaved images/text. In particular, let's try few-shot image captioning.
``` python
from PIL import Image
import requests
"""
Step 1: Load images
"""
demo_image_one = Image.open(
requests.get(
"http://images.cocodataset.org/val2017/000000039769.jpg", stream=True
).raw
)
demo_image_two = Image.open(
requests.get(
"http://images.cocodataset.org/test-stuff2017/000000028137.jpg",
stream=True
).raw
)
query_image = Image.open(
requests.get(
"http://images.cocodataset.org/test-stuff2017/000000028352.jpg",
stream=True
).raw
)
"""
Step 2: Preprocessing images
Details: For OpenFlamingo, we expect the image to be a torch tensor of shape
batch_size x num_media x num_frames x channels x height x width.
In this case batch_size = 1, num_media = 3, num_frames = 1,
channels = 3, height = 224, width = 224.
"""
vision_x = [image_processor(demo_image_one).unsqueeze(0), image_processor(demo_image_two).unsqueeze(0), image_processor(query_image).unsqueeze(0)]
vision_x = torch.cat(vision_x, dim=0)
vision_x = vision_x.unsqueeze(1).unsqueeze(0)
"""
Step 3: Preprocessing text
Details: In the text we expect an <image> special token to indicate where an image is.
We also expect an <|endofchunk|> special token to indicate the end of the text
portion associated with an image.
"""
tokenizer.padding_side = "left" # For generation padding tokens should be on the left
lang_x = tokenizer(
["<image>An image of two cats.<|endofchunk|><image>An image of a bathroom sink.<|endofchunk|><image>An image of"],
return_tensors="pt",
)
"""
Step 4: Generate text
"""
generated_text = model.generate(
vision_x=vision_x,
lang_x=lang_x["input_ids"],
attention_mask=lang_x["attention_mask"],
max_new_tokens=20,
num_beams=3,
)
print("Generated text: ", tokenizer.decode(generated_text[0]))
```
### Bias, Risks, and Limitations
OpenFlamingo models inherit the risks of their parent models, especially the language model. As an open-source research effort, we highly value open, accessible, reproducible multimodal model research; however, it is crucial to be aware that these models are trained on web data, have not been finetuned for safety, and thus may produce unintended, inappropriate, unreliable, and/or inaccurate outputs. Please use caution before deploying OpenFlamingo models in real applications. We also hope that OpenFlamingo enables further safety and reliability research to address these issues.
In an effort to mitigate current potential biases and harms, we have deployed a text content filter on model outputs in the OpenFlamingo demo. We continue to red-team the model to understand and improve its safety.
## Evaluation
<table>
<tr>
<th></th>
<th>0-shot</th>
<th>4-shot</th>
<th>8-shot</th>
<th>16-shot</th>
<th>32-shot</th>
</tr>
<tr>
<th>COCO (CIDEr)</th>
<td>79.5 (0.2)</td>
<td>89.0 (0.3)</td>
<td>96.3 (0.1)</td>
<td>98.8 (0.7)</td>
<td>99.5 (0.1)</td>
</tr>
<tr>
<th>VQAv2 (Accuracy)</th>
<td>50.3 (0.7)</td>
<td>50.5 (0.5)</td>
<td>52.8 (0.3)</td>
<td>52.3 (0.3)</td>
<td>50.5 (0.0)</td>
</tr>
<tr>
<th>Flickr-30K (CIDEr)</th>
<td>59.5 (1.0)</td>
<td>65.8 (0.6)</td>
<td>62.9 (1.0)</td>
<td>62.8 (1.0)</td>
<td>61.3 (0.7)</td>
</tr>
<tr>
<th>OK-VQA (Accuracy)</th>
<td>34.7 (0.1)</td>
<td>34.3 (0.1)</td>
<td>38.4 (0.0)</td>
<td>39.5 (0.1)</td>
<td>38.1 (0.0)</td>
</tr>
<tr>
<th>TextVQA (Accuracy)</th>
<td>24.2 (0.5)</td>
<td>28.2 (0.4)</td>
<td>29.1 (0.1)</td>
<td>27.3 (0.1)</td>
<td>23.8 (0.2)</td>
</tr>
<tr>
<th>Vizwiz (Accuracy)</th>
<td>17.7 (0.7)</td>
<td>23.1 (0.9)</td>
<td>31.6 (1.5)</td>
<td>38.0 (1.1)</td>
<td>40.2 (0.7)</td>
</tr>
<tr>
<th>Hateful Memes (ROC AUC)</th>
<td>50.8 (4.7)</td>
<td>47.5 (2.2)</td>
<td>45.2 (2.7)</td>
<td>46.9 (3.8)</td>
<td>52.0 (2.1)</td>
</tr>
</table
|