-
The Impact of Depth and Width on Transformer Language Model Generalization
Paper • 2310.19956 • Published • 9 -
Retentive Network: A Successor to Transformer for Large Language Models
Paper • 2307.08621 • Published • 170 -
RWKV: Reinventing RNNs for the Transformer Era
Paper • 2305.13048 • Published • 15 -
Attention Is All You Need
Paper • 1706.03762 • Published • 50
Collections
Discover the best community collections!
Collections including paper arxiv:2309.08586
-
Replacing softmax with ReLU in Vision Transformers
Paper • 2309.08586 • Published • 17 -
Softmax Bias Correction for Quantized Generative Models
Paper • 2309.01729 • Published • 1 -
The Closeness of In-Context Learning and Weight Shifting for Softmax Regression
Paper • 2304.13276 • Published • 1 -
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing
Paper • 2306.12929 • Published • 12
-
Efficient Memory Management for Large Language Model Serving with PagedAttention
Paper • 2309.06180 • Published • 25 -
LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models
Paper • 2308.16137 • Published • 39 -
Scaling Transformer to 1M tokens and beyond with RMT
Paper • 2304.11062 • Published • 2 -
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
Paper • 2309.14509 • Published • 17
-
Language Modeling Is Compression
Paper • 2309.10668 • Published • 83 -
Baichuan 2: Open Large-scale Language Models
Paper • 2309.10305 • Published • 19 -
Chain-of-Verification Reduces Hallucination in Large Language Models
Paper • 2309.11495 • Published • 37 -
LMDX: Language Model-based Document Information Extraction and Localization
Paper • 2309.10952 • Published • 65
-
Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers
Paper • 2309.08532 • Published • 53 -
A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale
Paper • 2309.06497 • Published • 4 -
MindAgent: Emergent Gaming Interaction
Paper • 2309.09971 • Published • 11 -
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
Paper • 2309.09400 • Published • 84
-
Self-Alignment with Instruction Backtranslation
Paper • 2308.06259 • Published • 41 -
ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation
Paper • 2308.03793 • Published • 10 -
From Sparse to Soft Mixtures of Experts
Paper • 2308.00951 • Published • 20 -
Revisiting DETR Pre-training for Object Detection
Paper • 2308.01300 • Published • 9