Summary: EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

yewentao included in category Paper_summary

2025-07-26 2025-11-26 365 words 2 minutes

Contents

0. Materials

Proposes EAGLE‑3, an inference‑acceleration method that extends speculative decoding for LLM decoding.
Replaces feature‑vector prediction with direct token prediction and introduces a training‑time test (TTT) loop to train the draft model on its own noisy outputs.
Fuses low, mid, and high‑level hidden states from the target model, instead of relying solely on top‑layer features.

Removes feature‑regression loss present in EAGLE/EAGLE‑2, freeing the draft model from high‑dimensional MSE constraints.
During training, the draft model repeatedly feeds its own predictions back as inputs, aligning train‑test distributions and preventing error accumulation.

Speed‑up & acceptance length measured against vanilla decoding, spec‑sampling, PLD, Hydra, Medusa, HASS, EAGLE, and EAGLE‑2. EAGLE‑3 tops all with 3‑6.5 × gains.
Throughput +38 % at batch 64 in SGLang on H100; EAGLE‑2 regresses beyond batch 24.
Throughput still ≥ 1.4 × at batch 24 in vLLM on RTX 3090, while EAGLE‑2 turns negative earlier.
Acceptance rates and speed‑ups grow nearly linearly with training‑data scale (1 ×→8 × ShareGPT), validating scaling law.

No evidence on 400 B‑scale frontier models owing to GPU budget.
Training still requires access to target‑model hidden states, limiting applicability to closed‑source APIs (Yes, close AI!).
Additional FC + decoder layer adds draft‑model compute.
Evaluation focuses on latency/throughput, does not re‑validate generation quality

Scale to 100B + LLMs (e.g., Mixtral, GPT‑J‑MoE).
Combine with quantization & MoE routing to compound speed‑ups while controlling memory footprint.
Extend TTT to multi‑token prediction (Medusa‑style) or contrastive draft selection, potentially raising acceptance length further.
Integrate into distributed serving (TP / PP).

Medusa: attaches multiple decoding heads to the backbone LLM to predict several future tokens in parallel
Hydra: extension of Medusa where the multiple heads are sequentially dependent, further improving acceptance without extra passes
HASS (HArmonized Speculative Sampling): method that distils “harmonised” objectives and contexts into the draft to mitigate train–test mismatch
Contrastive draft selection: choosing between multiple draft candidates by ranking them with a contrastive objective