Contents

Summary: DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

image

  • Introduces DeepSeek‑V2, an open‑source 236 B MoE LLM supporting 128 K context with only 21 B active parameters per token.

  • Proposes MLA to jointly compress keys/values into a small latent vector, slashing KV‑cache size without hurting accuracy.

  • Presents DeepSeekMoE, a fine‑grained expert design with shared‑expert isolation, device‑limited routing, balance losses, and token‑dropping to train large sparse models economically.

  • 1.5 M‑sample SFT followed by GRPO reinforcement learning, yielding DeepSeek‑V2‑Chat (SFT) and DeepSeek‑V2‑Chat (RL).

  • MLA replaces MHA / GQA / MQA by low‑rank joint KV compression + decoupled RoPE, cutting cache ≈ ×9 while improving MMLU over MHA.

  • DeepSeekMoE has 2 shared + 160 routed experts per layer, device‑limited routing (≤ 3 GPUs/token) and three‑level balance losses—significantly lowering communication relative to GShard.

  • 16‑stage pipeline + 8‑way expert parallel + KV recompute remove need for TP while training, achieving 172.8 k GPU‑h/T‑tokens (‑42.5 % vs dense).

  • 128 K context via YaRN adaptation on the decoupled key path

  • GRPO eliminates a critic network, reducing RL memory while boosting reasoning and open‑ended performance.

  • Compare KV‑cache size, GPU‑hours, and throughput.

  • MHA vs GQA vs MQA vs MLA; unrestricted vs device‑limited routing; SFT vs RL on math/code for ablations.

  • MMLU, GSM8K, HumanEval, C‑Eval and more for standard benchmarks, beating or matching Llama‑3‑70B, Mixtral‑8×22 B, Qwen‑1.5‑72 B.

  • MT‑Bench and AlpacaEval for open‑ended chat.

  • NIAH precision stays >90 % up to 128 K tokens for long‑context.

  • Hallucination risk persists, e.g., code package “slop-squatting” vulnerabilities common to LLMs.

  • Alignment introduces an “alignment tax”—BBH and some logical tasks drop slightly after RL.

  • Performance still trails large proprietary models (GPT‑4‑Turbo, ERNIE‑4.0) on hardest Chinese reasoning and cross‑lingual tasks.

  • Supports only the text modality, with limited capabilities in languages beyond Chinese & English.

  • Scale up MoE (more experts / depth) while keeping activated params low to approach higher performance without prohibitive cost.

  • Extend to multimodal inputs (vision‑text, audio).

  • Explore critic‑free or direct‑preference optimisation to reduce alignment tax and further boost reasoning.

  • Broaden multilingual coverage by adding high‑quality data and targeted SFT/RL for non‑EN/ZH languages.

  • Needle‑in‑a‑Haystack (NIAH) Benchmark – tests long‑context recall by hiding a short “needle” passage inside tens of thousands of filler tokens and asking the model about it.

  • AlpacaEval 2.0 – an automatic, GPT‑4‑judged evaluation that reports the win rate of a model’s responses against a strong reference across diverse user instructions.

  • MT‑Bench – a multi‑turn dialogue benchmark where LLMs are scored by other strong LLMs on coherence, helpfulness and depth over 80 chatbot conversations.