inference optimization

zk67 's Collections

Training

LLM Data

Ilya Papers

LLM Reasoning Papers

LLM Technical Report

LLM Post Training

LLM Pre-Train

updated 2 days ago

Upvote

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Paper • 2205.14135 • Published May 27, 2022 • 11

Note https://spaces.ac.cn/archives/10091/comment-page-1 MHA -> MQA(Multi-Query Attention) -> GQA(Group Query Attention) -> MLA(Multi-Head Latent Attention)
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Paper • 2307.08691 • Published Jul 17, 2023 • 8
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Paper • 2407.08608 • Published Jul 11, 2024 • 1

Note A guide to LLM inference and performance https://www.baseten.co/blog/llm-transformer-inference-guide/ LLM inference speed of light https://zeux.io/2024/03/15/llm-inference-sol/
Fast Transformer Decoding: One Write-Head is All You Need

Paper • 1911.02150 • Published Nov 6, 2019 • 6

Note MQA(Multi Query Attention) 减少KV Cache的一次非常朴素的尝试 MQA的思路很简单，直接让所有Attention Head共享同一个K、V，用公式来说，就是取消MHA所有的k,v的上标(s) 使用MQA的模型包括PaLM、StarCoder、Gemini等。很明显，MQA直接将KV Cache减少到了原来的1/h，这是非常可观的，单从节省显存角度看已经是天花板了。效果方面，目前看来大部分任务的损失都比较有限，且MQA的支持者相信这部分损失可以通过进一步训练来弥补回。此外，注意到MQA由于共享了K、V，将会导致Attention的参数量减少了将近一半，而为了模型总参数量的不变，通常会相应地增大FFN/GLU的规模，这也能弥补一部分效果损失。
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Paper • 2305.13245 • Published May 22, 2023 • 5

Note GQA(Grouped-Query Attention) 有人担心MQA对KV Cache压缩太严重，会影响模型的学习效率以及最终效果,因此MHA与MQA之间的过渡版本GQA应运而生，GQA的思想也很朴素，它就是将所有Head分为g个组，g可以整除h，每组共享同一对K、V。当g=h时就是MHA，g=1时就是MQA，当1
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Paper • 2405.04434 • Published May 7, 2024 • 14

Note MLA(Multi-head Latent Attention)

Upvote