Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey Paper • 2412.18619 • Published 21 days ago • 49
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs Paper • 2412.18925 • Published 12 days ago • 86
Autoregressive Video Generation without Vector Quantization Paper • 2412.14169 • Published 19 days ago • 14
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference Paper • 2412.13663 • Published 19 days ago • 118
Progressive Multimodal Reasoning via Active Retrieval Paper • 2412.14835 • Published 18 days ago • 71
Byte Latent Transformer: Patches Scale Better Than Tokens Paper • 2412.09871 • Published 24 days ago • 83
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding Paper • 2412.10302 • Published 24 days ago • 11
Large Concept Models: Language Modeling in a Sentence Representation Space Paper • 2412.08821 • Published 26 days ago • 11
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling Paper • 2412.05271 • Published about 1 month ago • 123
SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints Paper • 2412.07760 • Published 27 days ago • 50
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions Paper • 2412.09596 • Published 25 days ago • 92