LLaVA-o1: Let Vision Language Models Reason Step-by-Step Paper • 2411.10440 • Published Nov 15, 2024 • 112
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss Paper • 2410.17243 • Published Oct 22, 2024 • 89
MoH: Multi-Head Attention as Mixture-of-Head Attention Paper • 2410.11842 • Published Oct 15, 2024 • 20 • 2
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model Paper • 2303.09867 • Published Mar 17, 2023
Multi-granularity Interaction Simulation for Unsupervised Interactive Segmentation Paper • 2303.13399 • Published Mar 23, 2023
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning Paper • 2303.14369 • Published Mar 25, 2023
Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment Paper • 2305.12218 • Published May 20, 2023
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding Paper • 2311.08046 • Published Nov 14, 2023 • 1