-
NVLM: Open Frontier-Class Multimodal LLMs
Paper • 2409.11402 • Published • 73 -
BRAVE: Broadening the visual encoding of vision-language models
Paper • 2404.07204 • Published • 18 -
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Paper • 2403.18814 • Published • 45 -
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper • 2409.17146 • Published • 106
Collections
Discover the best community collections!
Collections including paper arxiv:2409.12191
-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper • 2405.15223 • Published • 12 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 53 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 87 -
Matryoshka Multimodal Models
Paper • 2405.17430 • Published • 31
-
The Llama 3 Herd of Models
Paper • 2407.21783 • Published • 110 -
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Paper • 2409.12191 • Published • 76 -
Baichuan Alignment Technical Report
Paper • 2410.14940 • Published • 50 -
A Survey of Small Language Models
Paper • 2410.20011 • Published • 40
-
Qwen2.5-Coder Technical Report
Paper • 2409.12186 • Published • 139 -
Attention Heads of Large Language Models: A Survey
Paper • 2409.03752 • Published • 89 -
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency
Paper • 2409.02634 • Published • 90 -
OmniGen: Unified Image Generation
Paper • 2409.11340 • Published • 109
-
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper • 2409.17146 • Published • 106 -
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Paper • 2409.12191 • Published • 76 -
mistralai/Pixtral-12B-2409
Image-Text-to-Text • Updated • 562 -
HuggingFaceTB/SmolVLM-Instruct
Image-Text-to-Text • Updated • 54.9k • 310
-
Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection
Paper • 2409.08513 • Published • 12 -
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
Paper • 2409.08264 • Published • 43 -
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Paper • 2409.12191 • Published • 76 -
LLMs + Persona-Plug = Personalized LLMs
Paper • 2409.11901 • Published • 32
-
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 87 -
Visual Instruction Tuning
Paper • 2304.08485 • Published • 13 -
Improved Baselines with Visual Instruction Tuning
Paper • 2310.03744 • Published • 37 -
PALO: A Polyglot Large Multimodal Model for 5B People
Paper • 2402.14818 • Published • 23