Collections
Discover the best community collections!
Collections including paper arxiv:2501.05441
-
No More Adam: Learning Rate Scaling at Initialization is All You Need
Paper • 2412.11768 • Published • 41 -
SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training
Paper • 2501.06842 • Published • 14 -
The GAN is dead; long live the GAN! A Modern GAN Baseline
Paper • 2501.05441 • Published • 77
-
GenEx: Generating an Explorable World
Paper • 2412.09624 • Published • 88 -
IamCreateAI/Ruyi-Mini-7B
Image-to-Video • Updated • 17.3k • 582 -
Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation
Paper • 2412.06016 • Published • 20 -
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper • 2412.09871 • Published • 88
-
CompCap: Improving Multimodal Large Language Models with Composite Captions
Paper • 2412.05243 • Published • 18 -
LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment
Paper • 2412.04814 • Published • 45 -
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
Paper • 2412.05237 • Published • 47 -
Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models
Paper • 2412.05939 • Published • 15
-
A Comprehensive Survey of Mamba Architectures for Medical Image Analysis: Classification, Segmentation, Restoration and Beyond
Paper • 2410.02362 • Published • 18 -
CheXagent: Towards a Foundation Model for Chest X-Ray Interpretation
Paper • 2401.12208 • Published • 22 -
Reliable Tuberculosis Detection using Chest X-ray with Deep Learning, Segmentation and Visualization
Paper • 2007.14895 • Published • 1 -
Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization
Paper • 2412.18525 • Published • 70
-
Depth Anything V2
Paper • 2406.09414 • Published • 96 -
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels
Paper • 2406.09415 • Published • 51 -
Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion
Paper • 2406.04338 • Published • 35 -
SAM 2: Segment Anything in Images and Videos
Paper • 2408.00714 • Published • 111
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 25 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 12 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 41 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 22