update readme
Browse files
README.md
CHANGED
@@ -7,9 +7,7 @@ pipeline_tag: image-text-to-text
|
|
7 |
|
8 |
|
9 |
# Model description
|
10 |
-
|
11 |
-
|
12 |
-
`XGen-MM` is a series of the latest foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research. This series advances upon the successful designs of the `BLIP` series, incorporating fundamental enhancements that ensure a more robust and superior foundation. These models have been trained at scale on high-quality image caption datasets and interleaved image-text data.
|
13 |
|
14 |
In the v1.5 (08/2024) release, we present a series of XGen-MM models including:
|
15 |
- [🤗 xGen-MM-base](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-base-r-v1.5): `xgen-mm-phi3-mini-base-r-v1.5`
|
@@ -23,7 +21,7 @@ In addition to the models, we are also releasing a series of datasets for multi-
|
|
23 |
- [🤗 BLIP3-GROUNDING-50M](https://huggingface.co/datasets/Salesforce/blip3-grounding-50m): a dataset for enhancing the ability to ground semantic concepts in images.
|
24 |
- BLIP3-KALE-300M (stay tuned): a large-scale curated high-quality caption dataset.
|
25 |
|
26 |
-
For more details, check out our [tech report]() and project page (coming soon).
|
27 |
|
28 |
# Data
|
29 |
The base model is pre-trained on a mixture of data sources described above, with around 100 billion image-text tokens in total.
|
@@ -61,7 +59,7 @@ Below are some qualitative examples below of the mutli-modal in-context learning
|
|
61 |
|
62 |
# How to use
|
63 |
|
64 |
-
Please check out our [inference notebook](demo.ipynb) for example code to use our model.
|
65 |
|
66 |
# Reproducibility:
|
67 |
|
@@ -77,7 +75,7 @@ We strongly recommend users assess safety and fairness before applying to downst
|
|
77 |
|
78 |
# License
|
79 |
|
80 |
-
Our code and weights are released under the
|
81 |
|
82 |
# Code acknowledgement
|
83 |
Our training code is based on [OpenFlamingo: An open-source framework for training large multimodal models.](https://github.com/mlfoundations/open_flamingo), and part of our data preprocessing code is adapted from [LLaVA](https://github.com/haotian-liu/LLaVA).
|
@@ -88,15 +86,16 @@ We thank the authors for their open-source implementations.
|
|
88 |
|
89 |
# Citation
|
90 |
```
|
91 |
-
@
|
92 |
-
|
93 |
-
|
94 |
-
|
95 |
-
|
96 |
-
|
97 |
}
|
98 |
```
|
99 |
|
|
|
100 |
# Troubleshoot
|
101 |
|
102 |
1. If you missed any packages, please consider the following
|
|
|
7 |
|
8 |
|
9 |
# Model description
|
10 |
+
`xGen-MM` is a series of the latest foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research. This series advances upon the successful designs of the `BLIP` series, incorporating fundamental enhancements that ensure a more robust and superior foundation. These models have been trained at scale on high-quality image caption datasets and interleaved image-text data.
|
|
|
|
|
11 |
|
12 |
In the v1.5 (08/2024) release, we present a series of XGen-MM models including:
|
13 |
- [🤗 xGen-MM-base](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-base-r-v1.5): `xgen-mm-phi3-mini-base-r-v1.5`
|
|
|
21 |
- [🤗 BLIP3-GROUNDING-50M](https://huggingface.co/datasets/Salesforce/blip3-grounding-50m): a dataset for enhancing the ability to ground semantic concepts in images.
|
22 |
- BLIP3-KALE-300M (stay tuned): a large-scale curated high-quality caption dataset.
|
23 |
|
24 |
+
For more details, check out our [tech report](https://arxiv.org/pdf/2408.08872) and project page (coming soon).
|
25 |
|
26 |
# Data
|
27 |
The base model is pre-trained on a mixture of data sources described above, with around 100 billion image-text tokens in total.
|
|
|
59 |
|
60 |
# How to use
|
61 |
|
62 |
+
Please check out our [inference notebook](demo.ipynb) for example code to use our model.
|
63 |
|
64 |
# Reproducibility:
|
65 |
|
|
|
75 |
|
76 |
# License
|
77 |
|
78 |
+
Our code and weights are released under the [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.txt) license.
|
79 |
|
80 |
# Code acknowledgement
|
81 |
Our training code is based on [OpenFlamingo: An open-source framework for training large multimodal models.](https://github.com/mlfoundations/open_flamingo), and part of our data preprocessing code is adapted from [LLaVA](https://github.com/haotian-liu/LLaVA).
|
|
|
86 |
|
87 |
# Citation
|
88 |
```
|
89 |
+
@article{blip3-xgenmm,
|
90 |
+
author = {Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, Ran Xu},
|
91 |
+
title = {xGen-MM(BLIP-3): A Family of Open Large Multimodal Models},
|
92 |
+
journal = {arXiv preprint},
|
93 |
+
month = {August},
|
94 |
+
year = {2024},
|
95 |
}
|
96 |
```
|
97 |
|
98 |
+
|
99 |
# Troubleshoot
|
100 |
|
101 |
1. If you missed any packages, please consider the following
|