-
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering
Paper • 2411.11504 • Published • 20 -
Top-nσ: Not All Logits Are You Need
Paper • 2411.07641 • Published • 20 -
Adaptive Decoding via Latent Preference Optimization
Paper • 2411.09661 • Published • 10 -
When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training
Paper • 2411.13476 • Published • 15
Collections
Discover the best community collections!
Collections including paper arxiv:2410.01131
-
VILA^2: VILA Augmented VILA
Paper • 2407.17453 • Published • 40 -
Octopus v4: Graph of language models
Paper • 2404.19296 • Published • 117 -
Octo-planner: On-device Language Model for Planner-Action Agents
Paper • 2406.18082 • Published • 48 -
Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models
Paper • 2408.15518 • Published • 43
-
STaR: Bootstrapping Reasoning With Reasoning
Paper • 2203.14465 • Published • 8 -
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper • 2401.06066 • Published • 46 -
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Paper • 2405.04434 • Published • 14 -
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Paper • 2311.04934 • Published • 29
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper • 2403.09611 • Published • 126 -
Evolutionary Optimization of Model Merging Recipes
Paper • 2403.13187 • Published • 51 -
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
Paper • 2402.03766 • Published • 14 -
LLM Agent Operating System
Paper • 2403.16971 • Published • 65
-
Why do Learning Rates Transfer? Reconciling Optimization and Scaling Limits for Deep Learning
Paper • 2402.17457 • Published -
Curvature-Informed SGD via General Purpose Lie-Group Preconditioners
Paper • 2402.04553 • Published -
TextGrad: Automatic "Differentiation" via Text
Paper • 2406.07496 • Published • 28 -
Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling
Paper • 2405.14578 • Published • 1