wanghaofan
's Collections
Multi-Modal Model
updated
What matters when building vision-language models?
Paper
•
2405.02246
•
Published
•
101
MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data
Paper
•
2406.18790
•
Published
•
33
Building and better understanding vision-language models: insights and
future directions
Paper
•
2408.12637
•
Published
•
124
Show-o: One Single Transformer to Unify Multimodal Understanding and
Generation
Paper
•
2408.12528
•
Published
•
51
Transfusion: Predict the Next Token and Diffuse Images with One
Multi-Modal Model
Paper
•
2408.11039
•
Published
•
58
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper
•
2408.08872
•
Published
•
98
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion
Paper
•
2407.01392
•
Published
•
39
IPAdapter-Instruct: Resolving Ambiguity in Image-based Conditioning
using Instruct Prompts
Paper
•
2408.03209
•
Published
•
21
Qwen/Qwen2-VL-7B-Instruct
Image-Text-to-Text
•
Updated
•
1.6M
•
•
1.01k
THUDM/cogvlm2-llama3-chat-19B
Text Generation
•
Updated
•
4.64k
•
203
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of
Encoders
Paper
•
2408.15998
•
Published
•
84
OmniGen: Unified Image Generation
Paper
•
2409.11340
•
Published
•
108
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language
Instructions
Paper
•
2409.15278
•
Published
•
24