Contents

Summary: Efficient Memory Management for Large Language Model Serving with PagedAttention

Paper

  • Proposes PagedAttention — an attention algorithm that stores Transformer KV‑cache in fixed‑size “pages” instead of one contiguous tensor.

  • Introduces vLLM, an open‑source LLM serving system that uses paging, block tables and copy‑on‑write to manage KV‑cache with near‑zero fragmentation.

  • Shows how paging enables prompt/beam sharing, dynamic growth, preemption (swap or recompute) and distributed model‑parallel execution.

  • Demonstrates 2‑4 × higher throughput than SOTA (FasterTransformer, Orca) at the same latency.

  • Treats KV‑cache like virtual memory pages, whereas earlier systems focused on scheduling or kernel speed.

  • Block‑level mapping table separating logical sequence positions from physical GPU blocks; supports on‑demand allocation and reuse.

  • Page‑granular copy‑on‑write to share KV across parallel sampling, beam search and shared prefixes—previous systems duplicated entire tensors.

  • All‑or‑nothing eviction + optional recomputation tailored to autoregressive workloads, not found in generic paging or model‑serving frameworks.

  • Provides end‑to‑end implementation with fused CUDA kernels so paging adds only ~20 % kernel overhead.

  • Throughput/latency traces on ShareGPT & Alpaca workloads with OPT‑13B/66B/175B and LLaMA‑13B, comparing vLLM vs. Orca variants & FasterTransformer.

  • Batch‑size analysis (Fig. 13) showing vLLM batches 2–4 × more requests under same memory.

  • Parallel sampling & beam search (Fig. 14) demonstrating bigger gains when KV sharing opportunities grow; memory‑saving quantified up to 55 %.

  • Shared‑prefix translation and chatbot scenarios to highlight prefix caching and long‑prompt handling (Fig. 16–17).

  • Kernel micro‑benchmarks (Fig. 18a) and block‑size sweep (Fig. 18b) to study paging overhead and optimal page size.

  • Swap vs. recompute microbenchmarks (Fig. 19) to justify pre‑emption design.

  • Ablations proving default block size = 16 tokens balances utilization and fragmentation.

  • Block‑size manually tuned; no adaptive policy across models / GPUs.

  • Multi‑GPU sharing still duplicates KV across shards; cross‑device paging left for future work.

  • Evaluation limited to decoder‑only LLMs; encoder‑decoder or Mixture‑of‑Experts models may need further adaptation.

  • Preemption tested only on PCIe‑based A100 clusters; swap performance CPU‑less accelerators or SSD tiers unknown.

  • Dynamically choose block size per layer or workload to maximize utilization automatically.

  • Extend block table so pages can migrate or be remotely accessed across GPUs, enabling KV deduplication in tensor parallel setups.

  • Apply paging to training for long‑context or continual‑learning scenarios where activations/KV resemble dynamic, shareable states.

  • QoS‑aware scheduler that mixes different decoding modes (greedy sampling, beam search) and prioritizes latency‑critical requests while using paging for background throughput.

  • Explore hardware support (e.g., on‑device SRAM page tables or GPU MMU hints) to further cut the ~20 % kernel overhead introduced by software paging.

  • Copy‑on‑Write (COW): When a shared page needs modification, allocate a new page, copy data once, then let writers diverge.
  • All‑or‑Nothing Eviction: vLLM’s policy that swaps/recomputes all blocks of one sequence together, exploiting that KV pages are always accessed as a set.
  • Pre-emption: Temporarily removing sequences when GPU memory is full; vLLM supports Swap (move pages to CPU) or Recompute (regenerate KV cache by rerunning the model on “prompt + generated tokens” once memory is available).
  • Parallel Sampling: Generate several stochastic completions for one prompt simultaneously; prompts’ KV pages can be fully shared.
  • Beam Search: Keeps top‑k hypotheses at each step; early‑prefix KV widely shared, divergent parts use COW; brings up to 55 % memory saving.
  • Shared Prefix: Common system prompt or few‑shot examples reused across requests; cached pages mapped by many sequences.
  • OPT: Open Pre-trained Transformer Language Models
  • FasterTransformer (FT): NVIDIA library with highly‑optimized Transformer kernels but contiguous KV tensors.
  • Orca: Prior serving system using iteration‑level scheduling but still contiguous KV; suffers from fragmentation.
  • Iteration‑level Scheduling: Insert or remove requests after every decoding step, avoiding long queueing and padding waste.
  • ShareGPT / Alpaca Workloads: Real conversation (long) vs. instruction‑tuning (short) traces used to test memory‑ and compute‑bound extremes.