On the Compositional Generalization of Multimodal LLMs for Medical Imaging Paper ā¢ 2412.20070 ā¢ Published 10 days ago ā¢ 40
Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization Paper ā¢ 2412.18525 ā¢ Published 14 days ago ā¢ 64
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis Paper ā¢ 2412.19723 ā¢ Published 11 days ago ā¢ 72
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control Paper ā¢ 2501.01427 ā¢ Published 5 days ago ā¢ 45
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining Paper ā¢ 2501.00958 ā¢ Published 6 days ago ā¢ 85
SigLIP Collection Contrastive (sigmoid) image-text models from https://arxiv.org/abs/2303.15343 ā¢ 10 items ā¢ Updated 25 days ago ā¢ 50
view article Article ColPali: Efficient Document Retrieval with Vision Language Models š By manu ā¢ Jul 5, 2024 ā¢ 184
Qwen2-VL Collection Vision-language model series based on Qwen2 ā¢ 16 items ā¢ Updated Dec 6, 2024 ā¢ 187
Architectural Approaches to Overcome Challenges in the Development of Data-Intensive Systems Paper ā¢ 2312.03049 ā¢ Published Dec 5, 2023 ā¢ 2