kevin1020
's Collections
Inference Acceleration
updated
BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language
Models
Paper
•
2401.12522
•
Published
•
11
Hydragen: High-Throughput LLM Inference with Shared Prefixes
Paper
•
2402.05099
•
Published
•
19
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
Paper
•
2402.04291
•
Published
•
48
Shortened LLaMA: A Simple Depth Pruning for Large Language Models
Paper
•
2402.02834
•
Published
•
14
Batch Prompting: Efficient Inference with Large Language Model APIs
Paper
•
2301.08721
•
Published
•
1
Recurrent Drafter for Fast Speculative Decoding in Large Language Models
Paper
•
2403.09919
•
Published
•
20
LLM Agent Operating System
Paper
•
2403.16971
•
Published
•
65
The Unreasonable Ineffectiveness of the Deeper Layers
Paper
•
2403.17887
•
Published
•
78
Better & Faster Large Language Models via Multi-token Prediction
Paper
•
2404.19737
•
Published
•
73
Clover: Regressive Lightweight Speculative Decoding with Sequential
Knowledge
Paper
•
2405.00263
•
Published
•
14
LLaMA-NAS: Efficient Neural Architecture Search for Large Language
Models
Paper
•
2405.18377
•
Published
•
18
TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices
Paper
•
2410.00531
•
Published
•
30