merve
's Collections
MIT Talk 31/10 Papers
updated
NVLM: Open Frontier-Class Multimodal LLMs
Paper
β’
2409.11402
β’
Published
β’
72
BRAVE: Broadening the visual encoding of vision-language models
Paper
β’
2404.07204
β’
Published
β’
18
Mini-Gemini: Mining the Potential of Multi-modality Vision Language
Models
Paper
β’
2403.18814
β’
Published
β’
45
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art
Multimodal Models
Paper
β’
2409.17146
β’
Published
β’
104
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large
Multimodal Models
Paper
β’
2407.07895
β’
Published
β’
40
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Paper
β’
2409.01704
β’
Published
β’
83
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at
Any Resolution
Paper
β’
2409.12191
β’
Published
β’
75
Unifying Multimodal Retrieval via Document Screenshot Embedding
Paper
β’
2406.11251
β’
Published
β’
9
LLaVA-OneVision: Easy Visual Task Transfer
Paper
β’
2408.03326
β’
Published
β’
59
ColPali: Efficient Document Retrieval with Vision Language Models
Paper
β’
2407.01449
β’
Published
β’
42
Paper
β’
2410.07073
β’
Published
β’
62
Building and better understanding vision-language models: insights and
future directions
Paper
β’
2408.12637
β’
Published
β’
124
PaliGemma: A versatile 3B VLM for transfer
Paper
β’
2407.07726
β’
Published
β’
68
Sigmoid Loss for Language Image Pre-Training
Paper
β’
2303.15343
β’
Published
β’
5