Papers
arxiv:2405.04434

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Published on May 7, 2024
Authors:
,
,
,
,
,
Liu Bo ,
,
,
,
,
,
,
,
,
,
,

Abstract

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.

Community

Architecture:
Decoder with Multi-head Latent Attention (MLA) and a Mixture-of-Experts (MoE). MLA reduces key-value cache demands during inference by utilizing low-rank joint compression of keys and values, improving efficiency.

Training:
πŸ“š Pretrained on 8.1T tokens, mostly English and Chinese, with a sequence length of 4096.
πŸŽ“ Supervised fine tuning (SFT) on 1.5M samples, 1.2M for helpfulness, and 0.3M for safety.
πŸ† Used Group Relative Policy Optimization (GRPO) to align the model outputs with human preferences, especially focusing on instruction following.

Learning Strategy (warmup-and-step-decay strategy):
⬆️ Linear Learning rate from 0 to the maximum value (2.4e-4) during the first 2K steps (warmup period).
⬇️ After training about 60% of tokens, the learning rate is reduced and multiplied by 0.316.
⬇️ After training about 90% of tokens, the learning rate is again reduced by 0.316.

Other insights:
πŸ“ˆ Used batch size scheduling from 2304 to 9216 for the first 225B tokens.
🌐 Used YaRN to extend the context window from 4K to 128K.
πŸ’° 42.5% reduced training cost compared to Deepseek 67B due to sparse activation
πŸ† MMLU: 78.5 ; AlpacaEval 2.0: 38.9; MT-Bench: 8.97 πŸ”§ Used Pipeline Parallelism, Expert Parallelism, and Data Parallelism for distributed training.
🎯 GRPO used a multi-reward framework with rewards from helpful, safety, and rule-based rewards.
πŸ”‘ MLA significantly reduces the KV cache by compressing them into a latent vector
πŸš€ Used hybrid engine with vLLM inference backend for RLHF training

Thanks for the insights! Still unclear to me how one can use the latent and compressed KV vector to get the Key/Values without having to compute them each time we are performing the attention (by doing the up-projection from the compressed vector)

This relates to this passage : "In addition, during inference, since π‘Šπ‘ˆπΎ can be absorbed into π‘Šπ‘„, and π‘Šπ‘ˆπ‘‰ can be absorbed into π‘Šπ‘‚, we even do not need to compute keys and values out for attention" which I kinda struggle to understand :/

Sign up or log in to comment

Models citing this paper 24

Browse 24 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2405.04434 in a dataset README.md to link it from this page.

Spaces citing this paper 28

Collections including this paper 15