kaizuberbuehler
's Collections
LM Architectures
updated
Megalodon: Efficient LLM Pretraining and Inference with Unlimited
Context Length
Paper
•
2404.08801
•
Published
•
65
RecurrentGemma: Moving Past Transformers for Efficient Open Language
Models
Paper
•
2404.07839
•
Published
•
44
Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
Paper
•
2404.05892
•
Published
•
33
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Paper
•
2312.00752
•
Published
•
139
Multi-Head Mixture-of-Experts
Paper
•
2404.15045
•
Published
•
60
Jamba: A Hybrid Transformer-Mamba Language Model
Paper
•
2403.19887
•
Published
•
107
KAN: Kolmogorov-Arnold Networks
Paper
•
2404.19756
•
Published
•
109
Better & Faster Large Language Models via Multi-token Prediction
Paper
•
2404.19737
•
Published
•
74
Contextual Position Encoding: Learning to Count What's Important
Paper
•
2405.18719
•
Published
•
5
Transformers are SSMs: Generalized Models and Efficient Algorithms
Through Structured State Space Duality
Paper
•
2405.21060
•
Published
•
64
An Image is Worth More Than 16x16 Patches: Exploring Transformers on
Individual Pixels
Paper
•
2406.09415
•
Published
•
51
Alleviating Distortion in Image Generation via Multi-Resolution
Diffusion Models
Paper
•
2406.09416
•
Published
•
28
Transformers meet Neural Algorithmic Reasoners
Paper
•
2406.09308
•
Published
•
44
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context
Language Modeling
Paper
•
2406.07522
•
Published
•
38
Explore the Limits of Omni-modal Pretraining at Scale
Paper
•
2406.09412
•
Published
•
10
Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo
Tree Self-refine with LLaMa-3 8B
Paper
•
2406.07394
•
Published
•
26
VideoLLM-online: Online Video Large Language Model for Streaming Video
Paper
•
2406.11816
•
Published
•
23
Mixture of A Million Experts
Paper
•
2407.04153
•
Published
•
5
Scaling Retrieval-Based Language Models with a Trillion-Token Datastore
Paper
•
2407.12854
•
Published
•
30
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware
Experts
Paper
•
2407.21770
•
Published
•
22
Transformer Explainer: Interactive Learning of Text-Generative Models
Paper
•
2408.04619
•
Published
•
156
Jamba-1.5: Hybrid Transformer-Mamba Models at Scale
Paper
•
2408.12570
•
Published
•
31
Show-o: One Single Transformer to Unify Multimodal Understanding and
Generation
Paper
•
2408.12528
•
Published
•
51
LLMs + Persona-Plug = Personalized LLMs
Paper
•
2409.11901
•
Published
•
32
MonoFormer: One Transformer for Both Diffusion and Autoregression
Paper
•
2409.16280
•
Published
•
18
Paper
•
2410.05258
•
Published
•
169
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper
•
2412.09871
•
Published
•
89
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
Paper
•
2412.01169
•
Published
•
12
Monet: Mixture of Monosemantic Experts for Transformers
Paper
•
2412.04139
•
Published
•
12
MH-MoE:Multi-Head Mixture-of-Experts
Paper
•
2411.16205
•
Published
•
24
Hymba: A Hybrid-head Architecture for Small Language Models
Paper
•
2411.13676
•
Published
•
40
SageAttention2 Technical Report: Accurate 4 Bit Attention for
Plug-and-play Inference Acceleration
Paper
•
2411.10958
•
Published
•
52
BitNet a4.8: 4-bit Activations for 1-bit LLMs
Paper
•
2411.04965
•
Published
•
64
Mixture-of-Transformers: A Sparse and Scalable Architecture for
Multi-Modal Foundation Models
Paper
•
2411.04996
•
Published
•
50
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep
Thinking
Paper
•
2501.04519
•
Published
•
237
MiniMax-01: Scaling Foundation Models with Lightning Attention
Paper
•
2501.08313
•
Published
•
263
Tensor Product Attention Is All You Need
Paper
•
2501.06425
•
Published
•
72
Transformer^2: Self-adaptive LLMs
Paper
•
2501.06252
•
Published
•
48
Learnings from Scaling Visual Tokenizers for Reconstruction and
Generation
Paper
•
2501.09755
•
Published
•
30
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Paper
•
2501.09747
•
Published
•
21