Leyo commited on
Commit
62de87a
·
1 Parent(s): e23b75c

Switch to IDEFICS/ OBELICS

Browse files
Files changed (1) hide show
  1. README.md +14 -14
README.md CHANGED
@@ -6,7 +6,7 @@ tags:
6
  - image
7
  license: other
8
  datasets:
9
- - HuggingFaceM4/OBELISC
10
  - wikipedia
11
  - facebook/pmd
12
  - laion/laion2B-en
@@ -18,8 +18,8 @@ TODO: logo?
18
  # Model Card for m4-80b
19
 
20
  <!-- Provide a quick summary of what the model is/does. [Optional] -->
21
- ATUM (**A**dapted **T**ransformers for **U**nstructured **M**ultimodal data) is an open-access reproduction of Flamingo, a closed-source visual language model developed by Deepmind. The multimodal model accepts arbitrary sequences of image and text inputs and produces text outputs and is built solely on public available data and models.
22
- ATUM (TODO) is on par with the original model on various image + text benchmarks, including visual question answering (open-ended and multiple choice), image captioning, and image classification when evaluated with in-context few-shot learning.
23
 
24
  The model comes into two variants: a large [80 billion parameters version](https://huggingface.co/HuggingFaceM4/m4-80b) and a [9 billion parameters version](https://huggingface.co/HuggingFaceM4/m4-9b).
25
  We also fine-tune these base models on a mixture of SFT datasets (TODO: find a more understandable characterization), which boosts the downstream performance while making the models more usable in conversational settings: (TODO: 80B-sfted) and (TODO: 9B sfted).
@@ -72,14 +72,14 @@ We also fine-tune these base models on a mixture of SFT datasets (TODO: find a m
72
  - **Parent Model:** [laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and [huggyllama/llama-65b](https://huggingface.co/huggyllama/llama-65b)
73
  - **Resources for more information:**
74
  - [GitHub Repo](https://github.com/huggingface/m4/)
75
- - Description of [OBELISC](https://huggingface.co/datasets/HuggingFaceM4/OBELISC): [OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
76
  ](https://huggingface.co/papers/2306.16527)
77
  - Original Paper: [Flamingo: a Visual Language Model for Few-Shot Learning](https://huggingface.co/papers/2204.14198)
78
 
79
- ATUM is a large multimodal English model that takes sequences of interleaved images and texts as inputs and generates text outputs.
80
  The model shows strong in-context few-shot learning capabilities (and on par with the closed-source model), and is a robust starting point to fine-tune multimodal models on custom data.
81
 
82
- ATUM is built on top of two unimodal open-access pre-trained models to connect the two modalities. Newly initialized parameters in the form of Transformer blocks bridge the gap between the vision encoder and the language model. The model is trained on a mixture of image/text pairs and unstrucutred multimodal web documents.
83
 
84
 
85
  # Uses
@@ -117,12 +117,12 @@ The model is trained on the following data mixture of openly accessible English
117
 
118
  | Data Source | Type of Data | Number of Tokens in Source | Number of Images in Source | Epochs | Effective Proportion in Number of Tokens |
119
  |-------------|-----------------------------------------|---------------------------|---------------------------|--------|-----------------------------------------|
120
- | [OBELISC](https://huggingface.co/datasets/HuggingFaceM4/OBELISC) | Unstructured Multimodal Web Documents | 114.9B | 353M | 1 | 73.85% |
121
  | [Wikipedia](https://huggingface.co/datasets/wikipedia) | Unstructured Multimodal Web Documents | 3.192B | 39M | 3 | 6.15% |
122
  | [LAION](https://huggingface.co/datasets/laion/laion2B-en) | Image-Text Pairs | 29.9B | 1.120B | 1 | 17.18%
123
  | [PMD](https://huggingface.co/datasets/facebook/pmd) | Image-Text Pairs | 1.6B | 70M | 3 | 2.82% | |
124
 
125
- **OBELISC** is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available [here](TODO).
126
 
127
  **Wkipedia** is the multimodal equivalent of the encyclopedia. We used the English dump of Wikipedia created on February 20th, 2023.
128
 
@@ -137,7 +137,7 @@ Following (Dehghani et al., 2023)[https://huggingface.co/papers/2302.05442], we
137
  The training objective is the standard next token prediction.
138
 
139
  We use the following hyper and training parameters:
140
- | Parameters | | ATUM | ATUM-9b |
141
  | -- | -- | -- | -- |
142
  | Perceiver Resampler | Number of Layers | 6 | 6 |
143
  | | Number of Latents | 64 | 64 |
@@ -165,11 +165,11 @@ We use the following hyper and training parameters:
165
  # Evaluation
166
 
167
  <!-- This section describes the evaluation protocols and provides the results. -->
168
- We closely follow the evaluation protocol of Flamingo and evaluate ATUM on a suite of downstream image + text benchmarks ranging from visual question answering to image captioning.
169
 
170
  We compare our model to the original Flamingo along with [OpenFlamingo](openflamingo/OpenFlamingo-9B-vitl-mpt7b), another open-source reproduction.
171
 
172
- We perform checkpoint selection based on validation sets of TODO, and select the checkpoint at step 65'000 for ATUM-9B and at step 37'500 for ATUM. The models are evaluated with in-context few-shot learning where the priming instances are selected from a support set to be similar (i.e. close in a vector space) to the queried instance. We do not use any form of ensembling.
173
 
174
  TODO: beautiful plots of shots scaling laws.
175
 
@@ -205,13 +205,13 @@ The training software is built on top of HuggingFace Transformers + Accelerate,
205
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
206
 
207
  Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
208
- As a derivative of such a language model, ATUM can produce texts that include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
209
- Moreover, ATUM can produce factually incorrect texts, and should not be relied on to produce factually accurate information.
210
 
211
  Here are a few examples of outputs that could be categorized as factually incorrect, biased, or offensive:
212
  TODO: give 4/5 representative examples
213
 
214
- To measure ATUM's ability to recognize socilogical (TODO: find a better adjective) attributes, we evaluate the model on FairFace...
215
  TODO: include FairFace numbers
216
 
217
 
 
6
  - image
7
  license: other
8
  datasets:
9
+ - HuggingFaceM4/OBELICS
10
  - wikipedia
11
  - facebook/pmd
12
  - laion/laion2B-en
 
18
  # Model Card for m4-80b
19
 
20
  <!-- Provide a quick summary of what the model is/does. [Optional] -->
21
+ IDEFICS (**I**mage-aware **D**ecoder **E**nhanced à la **F**lamingo with **I**nterleaved **C**ross-attention**S**) is an open-access reproduction of Flamingo, a closed-source visual language model developed by Deepmind. The multimodal model accepts arbitrary sequences of image and text inputs and produces text outputs and is built solely on public available data and models.
22
+ IDEFICS (TODO) is on par with the original model on various image + text benchmarks, including visual question answering (open-ended and multiple choice), image captioning, and image classification when evaluated with in-context few-shot learning.
23
 
24
  The model comes into two variants: a large [80 billion parameters version](https://huggingface.co/HuggingFaceM4/m4-80b) and a [9 billion parameters version](https://huggingface.co/HuggingFaceM4/m4-9b).
25
  We also fine-tune these base models on a mixture of SFT datasets (TODO: find a more understandable characterization), which boosts the downstream performance while making the models more usable in conversational settings: (TODO: 80B-sfted) and (TODO: 9B sfted).
 
72
  - **Parent Model:** [laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and [huggyllama/llama-65b](https://huggingface.co/huggyllama/llama-65b)
73
  - **Resources for more information:**
74
  - [GitHub Repo](https://github.com/huggingface/m4/)
75
+ - Description of [OBELICS](https://huggingface.co/datasets/HuggingFaceM4/OBELICS): [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
76
  ](https://huggingface.co/papers/2306.16527)
77
  - Original Paper: [Flamingo: a Visual Language Model for Few-Shot Learning](https://huggingface.co/papers/2204.14198)
78
 
79
+ IDEFICS is a large multimodal English model that takes sequences of interleaved images and texts as inputs and generates text outputs.
80
  The model shows strong in-context few-shot learning capabilities (and on par with the closed-source model), and is a robust starting point to fine-tune multimodal models on custom data.
81
 
82
+ IDEFICS is built on top of two unimodal open-access pre-trained models to connect the two modalities. Newly initialized parameters in the form of Transformer blocks bridge the gap between the vision encoder and the language model. The model is trained on a mixture of image/text pairs and unstrucutred multimodal web documents.
83
 
84
 
85
  # Uses
 
117
 
118
  | Data Source | Type of Data | Number of Tokens in Source | Number of Images in Source | Epochs | Effective Proportion in Number of Tokens |
119
  |-------------|-----------------------------------------|---------------------------|---------------------------|--------|-----------------------------------------|
120
+ | [OBELICS](https://huggingface.co/datasets/HuggingFaceM4/OBELICS) | Unstructured Multimodal Web Documents | 114.9B | 353M | 1 | 73.85% |
121
  | [Wikipedia](https://huggingface.co/datasets/wikipedia) | Unstructured Multimodal Web Documents | 3.192B | 39M | 3 | 6.15% |
122
  | [LAION](https://huggingface.co/datasets/laion/laion2B-en) | Image-Text Pairs | 29.9B | 1.120B | 1 | 17.18%
123
  | [PMD](https://huggingface.co/datasets/facebook/pmd) | Image-Text Pairs | 1.6B | 70M | 3 | 2.82% | |
124
 
125
+ **OBELICS** is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available [here](TODO).
126
 
127
  **Wkipedia** is the multimodal equivalent of the encyclopedia. We used the English dump of Wikipedia created on February 20th, 2023.
128
 
 
137
  The training objective is the standard next token prediction.
138
 
139
  We use the following hyper and training parameters:
140
+ | Parameters | | IDEFICS | IDEFICS-9b |
141
  | -- | -- | -- | -- |
142
  | Perceiver Resampler | Number of Layers | 6 | 6 |
143
  | | Number of Latents | 64 | 64 |
 
165
  # Evaluation
166
 
167
  <!-- This section describes the evaluation protocols and provides the results. -->
168
+ We closely follow the evaluation protocol of Flamingo and evaluate IDEFICS on a suite of downstream image + text benchmarks ranging from visual question answering to image captioning.
169
 
170
  We compare our model to the original Flamingo along with [OpenFlamingo](openflamingo/OpenFlamingo-9B-vitl-mpt7b), another open-source reproduction.
171
 
172
+ We perform checkpoint selection based on validation sets of TODO, and select the checkpoint at step 65'000 for IDEFICS-9B and at step 37'500 for IDEFICS. The models are evaluated with in-context few-shot learning where the priming instances are selected from a support set to be similar (i.e. close in a vector space) to the queried instance. We do not use any form of ensembling.
173
 
174
  TODO: beautiful plots of shots scaling laws.
175
 
 
205
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
206
 
207
  Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
208
+ As a derivative of such a language model, IDEFICS can produce texts that include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
209
+ Moreover, IDEFICS can produce factually incorrect texts, and should not be relied on to produce factually accurate information.
210
 
211
  Here are a few examples of outputs that could be categorized as factually incorrect, biased, or offensive:
212
  TODO: give 4/5 representative examples
213
 
214
+ To measure IDEFICS's ability to recognize socilogical (TODO: find a better adjective) attributes, we evaluate the model on FairFace...
215
  TODO: include FairFace numbers
216
 
217