LM Inference - a kaizuberbuehler Collection

kaizuberbuehler 's Collections

Reasoning Models

Image Generation

Vision Language Models

Foundation Models

Synthetic Data and Self-Improvement

Agents

Video Generation

LM Prompt Engineering

LM Capabilities and Scaling

Music Generation

LM Architectures

Code Generation

Speech Synthesis

EXL2 Quantized Models

LM Inference

updated about 18 hours ago

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Paper • 2402.17764 • Published Feb 27, 2024 • 605
BitNet: Scaling 1-bit Transformers for Large Language Models

Paper • 2310.11453 • Published Oct 17, 2023 • 96
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Paper • 2404.02258 • Published Apr 2, 2024 • 104
TransformerFAM: Feedback attention is working memory

Paper • 2404.09173 • Published Apr 14, 2024 • 43
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length

Paper • 2404.08801 • Published Apr 12, 2024 • 64
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Paper • 2404.07143 • Published Apr 10, 2024 • 104
Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence

Paper • 2404.05892 • Published Apr 8, 2024 • 32
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Paper • 2404.05726 • Published Apr 8, 2024 • 20
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

Paper • 2402.13753 • Published Feb 21, 2024 • 114
How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study

Paper • 2404.14047 • Published Apr 22, 2024 • 44
SnapKV: LLM Knows What You are Looking for Before Generation

Paper • 2404.14469 • Published Apr 22, 2024 • 23
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

Paper • 2404.16710 • Published Apr 25, 2024 • 75
Octopus v4: Graph of language models

Paper • 2404.19296 • Published Apr 30, 2024 • 116
Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting

Paper • 2404.18911 • Published Apr 29, 2024 • 29
LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report

Paper • 2405.00732 • Published Apr 29, 2024 • 118
Imp: Highly Capable Large Multimodal Models for Mobile Devices

Paper • 2405.12107 • Published May 20, 2024 • 25
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Paper • 2405.21060 • Published May 31, 2024 • 63
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

Paper • 2406.07522 • Published Jun 11, 2024 • 37
Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B

Paper • 2406.07394 • Published Jun 11, 2024 • 25
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Paper • 2407.21787 • Published Jul 31, 2024 • 12
ThinK: Thinner Key Cache by Query-Driven Pruning

Paper • 2407.21018 • Published Jul 30, 2024 • 30
A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B

Paper • 2409.11055 • Published Sep 17, 2024 • 16
Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction

Paper • 2409.17422 • Published Sep 25, 2024 • 24
Thinking LLMs: General Instruction Following with Thought Generation

Paper • 2410.10630 • Published Oct 14, 2024 • 18
VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

Paper • 2409.17066 • Published Sep 25, 2024 • 28
Efficiently Serving LLM Reasoning Programs with Certaindex

Paper • 2412.20993 • Published 5 days ago • 28
Token-Budget-Aware LLM Reasoning

Paper • 2412.18547 • Published 10 days ago • 41