2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining Paper β’ 2501.00958 β’ Published 16 days ago β’ 95
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis Paper β’ 2412.19723 β’ Published 21 days ago β’ 79
VisionZip: Longer is Better but Not Necessary in Vision Language Models Paper β’ 2412.04467 β’ Published Dec 5, 2024 β’ 105
PaliGemma 2: A Family of Versatile VLMs for Transfer Paper β’ 2412.03555 β’ Published Dec 4, 2024 β’ 124
ShowUI: One Vision-Language-Action Model for GUI Visual Agent Paper β’ 2411.17465 β’ Published Nov 26, 2024 β’ 78
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives Paper β’ 2501.04003 β’ Published 10 days ago β’ 23
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM Paper β’ 2501.00599 β’ Published 17 days ago β’ 41
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs Paper β’ 2501.06186 β’ Published 7 days ago β’ 55
DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests Paper β’ 2501.04671 β’ Published 9 days ago