English
Irena Gao commited on
Commit
2f28ef4
·
1 Parent(s): e998648

update README

Browse files
Files changed (1) hide show
  1. README.md +77 -14
README.md CHANGED
@@ -12,11 +12,82 @@ OpenFlamingo is an open source implementation of DeepMind's [Flamingo](https://w
12
  This 9B-parameter model uses a [CLIP ViT-L/14](https://huggingface.co/openai/clip-vit-large-patch14) vision encoder and [MPT-7B](https://huggingface.co/mosaicml/mpt-7b) language model.
13
 
14
  ## Model Details
15
- We follow the Flamingo modeling paradigm, outfitting the layers of a pretrained, frozen language model such that they cross-attend to visual features when decoding. Following Flamingo, we freeze the vision encoder and language model but train the connecting modules on web-scraped image-text sequences. Specifically, we use a mixture of [LAION-2B](https://arxiv.org/abs/2210.08402) and [Multimodal C4](https://arxiv.org/abs/2304.06939).
 
 
16
 
17
  ## Uses
18
  OpenFlamingo models process arbitrarily interleaved sequences of images and text to output text. This allows the models to accept in-context examples and undertake tasks like captioning, visual question answering, and image classification.
19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  ### Bias, Risks, and Limitations
21
  OpenFlamingo models inherit the risks of their parent models, especially the language model. As an open-source research effort, we highly value open, accessible, reproducible multimodal model research; however, it is crucial to be aware that these models are trained on web data, have not been finetuned for safety, and thus may produce unintended, inappropriate, unreliable, and/or inaccurate outputs. Please use caution before deploying OpenFlamingo models in real applications. We also hope that OpenFlamingo enables further safety and reliability research to address these issues.
22
 
@@ -42,11 +113,11 @@ In an effort to mitigate current potential biases and harms, we have deployed a
42
  </tr>
43
  <tr>
44
  <th>VQAv2 (Accuracy)</th>
45
- <td>48.3 (0.1)</td>
46
- <td>49.4 (0.4)</td>
47
- <td>51.8 (0.4)</td>
48
- <td>51.3 (0.5)</td>
49
- <td>50.2 (0.6)</td>
50
  </tr>
51
  <tr>
52
  <th>Flickr-30K (CIDEr)</th>
@@ -80,14 +151,6 @@ In an effort to mitigate current potential biases and harms, we have deployed a
80
  <td>38.0 (1.1)</td>
81
  <td>40.2 (0.7)</td>
82
  </tr>
83
- <tr>
84
- <th>ImageNet (Top-1 Accuracy)</th>
85
- <td>-</td>
86
- <td>-</td>
87
- <td>-</td>
88
- <td>-</td>
89
- <td>-</td>
90
- </tr>
91
  <tr>
92
  <th>Hateful Memes (ROC AUC)</th>
93
  <td>-</td>
 
12
  This 9B-parameter model uses a [CLIP ViT-L/14](https://huggingface.co/openai/clip-vit-large-patch14) vision encoder and [MPT-7B](https://huggingface.co/mosaicml/mpt-7b) language model.
13
 
14
  ## Model Details
15
+ We follow the Flamingo modeling paradigm, outfitting the layers of a pretrained, frozen language model such that they cross-attend to visual features when decoding. Following Flamingo, we freeze the vision encoder and language model but train the connecting modules on web-scraped image-text sequences. Specifically, we trained this model on a mixture of [LAION-2B](https://arxiv.org/abs/2210.08402) and [Multimodal C4](https://arxiv.org/abs/2304.06939).
16
+
17
+ This model has cross-attention modules inserted in *every fourth* decoder block. It was trained using DistributedDataParallel across 64 A100 80GB GPUs at automatic BF16 mixed precision.
18
 
19
  ## Uses
20
  OpenFlamingo models process arbitrarily interleaved sequences of images and text to output text. This allows the models to accept in-context examples and undertake tasks like captioning, visual question answering, and image classification.
21
 
22
+ ### Generation example
23
+ Below is an example of generating text conditioned on interleaved images/text. In particular, let's try few-shot image captioning.
24
+
25
+ ``` python
26
+ from PIL import Image
27
+ import requests
28
+
29
+ """
30
+ Step 1: Load images
31
+ """
32
+ demo_image_one = Image.open(
33
+ requests.get(
34
+ "http://images.cocodataset.org/val2017/000000039769.jpg", stream=True
35
+ ).raw
36
+ )
37
+
38
+ demo_image_two = Image.open(
39
+ requests.get(
40
+ "http://images.cocodataset.org/test-stuff2017/000000028137.jpg",
41
+ stream=True
42
+ ).raw
43
+ )
44
+
45
+ query_image = Image.open(
46
+ requests.get(
47
+ "http://images.cocodataset.org/test-stuff2017/000000028352.jpg",
48
+ stream=True
49
+ ).raw
50
+ )
51
+
52
+
53
+ """
54
+ Step 2: Preprocessing images
55
+ Details: For OpenFlamingo, we expect the image to be a torch tensor of shape
56
+ batch_size x num_media x num_frames x channels x height x width.
57
+ In this case batch_size = 1, num_media = 3, num_frames = 1,
58
+ channels = 3, height = 224, width = 224.
59
+ """
60
+ vision_x = [image_processor(demo_image_one).unsqueeze(0), image_processor(demo_image_two).unsqueeze(0), image_processor(query_image).unsqueeze(0)]
61
+ vision_x = torch.cat(vision_x, dim=0)
62
+ vision_x = vision_x.unsqueeze(1).unsqueeze(0)
63
+
64
+ """
65
+ Step 3: Preprocessing text
66
+ Details: In the text we expect an <image> special token to indicate where an image is.
67
+ We also expect an <|endofchunk|> special token to indicate the end of the text
68
+ portion associated with an image.
69
+ """
70
+ tokenizer.padding_side = "left" # For generation padding tokens should be on the left
71
+ lang_x = tokenizer(
72
+ ["<image>An image of two cats.<|endofchunk|><image>An image of a bathroom sink.<|endofchunk|><image>An image of"],
73
+ return_tensors="pt",
74
+ )
75
+
76
+
77
+ """
78
+ Step 4: Generate text
79
+ """
80
+ generated_text = model.generate(
81
+ vision_x=vision_x,
82
+ lang_x=lang_x["input_ids"],
83
+ attention_mask=lang_x["attention_mask"],
84
+ max_new_tokens=20,
85
+ num_beams=3,
86
+ )
87
+
88
+ print("Generated text: ", tokenizer.decode(generated_text[0]))
89
+ ```
90
+
91
  ### Bias, Risks, and Limitations
92
  OpenFlamingo models inherit the risks of their parent models, especially the language model. As an open-source research effort, we highly value open, accessible, reproducible multimodal model research; however, it is crucial to be aware that these models are trained on web data, have not been finetuned for safety, and thus may produce unintended, inappropriate, unreliable, and/or inaccurate outputs. Please use caution before deploying OpenFlamingo models in real applications. We also hope that OpenFlamingo enables further safety and reliability research to address these issues.
93
 
 
113
  </tr>
114
  <tr>
115
  <th>VQAv2 (Accuracy)</th>
116
+ <td>-</td>
117
+ <td>-</td>
118
+ <td>-</td>
119
+ <td>-</td>
120
+ <td>-</td>
121
  </tr>
122
  <tr>
123
  <th>Flickr-30K (CIDEr)</th>
 
151
  <td>38.0 (1.1)</td>
152
  <td>40.2 (0.7)</td>
153
  </tr>
 
 
 
 
 
 
 
 
154
  <tr>
155
  <th>Hateful Memes (ROC AUC)</th>
156
  <td>-</td>