Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2402.11131

The Rise and Potential of Large Language Model Based Agents: A Survey

Paper • 2309.07864 • Published Sep 14, 2023 • 7
GAIA: a benchmark for General AI Assistants

Paper • 2311.12983 • Published Nov 21, 2023 • 187
Speculative Streaming: Fast LLM Inference without Auxiliary Models

Paper • 2402.11131 • Published Feb 16, 2024 • 42

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Paper • 2402.19427 • Published Feb 29, 2024 • 52
Simple linear attention language models balance the recall-throughput tradeoff

Paper • 2402.18668 • Published Feb 28, 2024 • 18
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition

Paper • 2402.15220 • Published Feb 23, 2024 • 19
Linear Transformers are Versatile In-Context Learners

Paper • 2402.14180 • Published Feb 21, 2024 • 6

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Paper • 2401.10774 • Published Jan 19, 2024 • 54
APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding

Paper • 2401.06761 • Published Jan 12, 2024 • 1
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

Paper • 2401.02669 • Published Jan 5, 2024 • 14
MambaByte: Token-free Selective State Space Model

Paper • 2401.13660 • Published Jan 24, 2024 • 53

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Paper • 2401.10774 • Published Jan 19, 2024 • 54
SliceGPT: Compress Large Language Models by Deleting Rows and Columns

Paper • 2401.15024 • Published Jan 26, 2024 • 69
Speculative Streaming: Fast LLM Inference without Auxiliary Models

Paper • 2402.11131 • Published Feb 16, 2024 • 42
SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization

Paper • 2405.11582 • Published May 19, 2024 • 14

Self-Rewarding Language Models

Paper • 2401.10020 • Published Jan 18, 2024 • 145
ReFT: Reasoning with Reinforced Fine-Tuning

Paper • 2401.08967 • Published Jan 17, 2024 • 29
Tuning Language Models by Proxy

Paper • 2401.08565 • Published Jan 16, 2024 • 21
TrustLLM: Trustworthiness in Large Language Models

Paper • 2401.05561 • Published Jan 10, 2024 • 66

Inference speed

FlashDecoding++: Faster Large Language Model Inference on GPUs

Paper • 2311.01282 • Published Nov 2, 2023 • 35
Co-training and Co-distillation for Quality Improvement and Compression of Language Models

Paper • 2311.02849 • Published Nov 6, 2023 • 3
Prompt Cache: Modular Attention Reuse for Low-Latency Inference

Paper • 2311.04934 • Published Nov 7, 2023 • 28
Exponentially Faster Language Modelling

Paper • 2311.10770 • Published Nov 15, 2023 • 117

Efficient Inference

Prompt Cache: Modular Attention Reuse for Low-Latency Inference

Paper • 2311.04934 • Published Nov 7, 2023 • 28
Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models

Paper • 2311.08692 • Published Nov 15, 2023 • 12
Exponentially Faster Language Modelling

Paper • 2311.10770 • Published Nov 15, 2023 • 117
Memory Augmented Language Models through Mixture of Word Experts

Paper • 2311.10768 • Published Nov 15, 2023 • 16

FlashDecoding++: Faster Large Language Model Inference on GPUs

Paper • 2311.01282 • Published Nov 2, 2023 • 35
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

Paper • 2401.02669 • Published Jan 5, 2024 • 14
Speculative Streaming: Fast LLM Inference without Auxiliary Models

Paper • 2402.11131 • Published Feb 16, 2024 • 42

Previous
1
2
Next

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs