Apollo: An Exploration of Video Understanding in Large Multimodal Models Paper • 2412.10360 • Published 24 days ago • 136
ShowUI: One Vision-Language-Action Model for GUI Visual Agent Paper • 2411.17465 • Published Nov 26, 2024 • 77
Contrastive Localized Language-Image Pre-Training Paper • 2410.02746 • Published Oct 3, 2024 • 33
LLaVA-Critic: Learning to Evaluate Multimodal Models Paper • 2410.02712 • Published Oct 3, 2024 • 35
LLaVA-Video Collection Models focus on video understanding (previously known as LLaVA-NeXT-Video). • 6 items • Updated Oct 5, 2024 • 56
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models Paper • 2407.12772 • Published Jul 17, 2024 • 33
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models Paper • 2402.07865 • Published Feb 12, 2024 • 12
Aligning Large Multimodal Models with Factually Augmented RLHF Paper • 2309.14525 • Published Sep 25, 2023 • 30
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models Paper • 2308.01390 • Published Aug 2, 2023 • 33
DIALGEN: Collaborative Human-LM Generated Dialogues for Improved Understanding of Human-Human Conversations Paper • 2307.07047 • Published Jul 13, 2023 • 15