Contents

Summary: Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

image

  • Proposes Sarathi-Serve, an online LLM inference scheduler that simultaneously improves throughput and tail latency.

  • Introduces chunked-prefills and stall-free batching.

  • Yields uniform-compute hybrid batches (decode + small prefill chunks) that avoid generation stalls and reduce pipeline bubbles in PP deployments.

  • Introduces Stall-free batching, admit decodes first, then partially completed prefills, then new prefills so that we never pausing decodes.

  • Systemized, SLO-aware scheduler built on vLLM with FlashAttention v2/FlashInfer kernels; supports TP/PP/hybrid parallelism and extensive telemetry.

  • Max sustainable QPS at P99 TBT targets (strict vs relaxed) across Mistral-7B (1×A100), Yi-34B (TP-2, 2×A100), LLaMA2-70B (TP-4 + PP-2, 8×A40), Falcon-180B (TP-4 + PP-2, 8×A100/2 nodes).

  • Compare vLLM (max batch 32/64/128) vs Sarathi-Serve (token budgets 512/2048). Shows vLLM capped by stalls under tight SLOs

  • Compare TP-8 vs TP-4 + PP-2 with/without Sarathi-Serve; show >2× lower median TBT than cross-node TP and big capacity gains.

  • Hybrid-batching-only reduces TTFT but hurts TBT; chunked-prefills-only improves TBT but hurts TTFT; Combined lowers both.

  • Choosing the token budget needs per-deployment profiling and careful tile-size alignment; dynamic control is not explored

  • Small chunks increase TTFT and add HBM reads

  • The scheduler focuses on SLO-driven batching; fairness, preemption, or per-tenant QoS are not the central focus.

  • Results emphasize A100/A40 + 100 GbE/NVLink; behavior on other interconnects, very long contexts, MoE, or speculative decoding is not extensively studied.

  • Online RL/feedback control to tune token budget and chunk sizes per iteration based on live TBT, batch mix, and PP bubble telemetry.

  • Combine stall-free batching within replicas and prefill/decode disaggregation across replicas (with lightweight KV transfer/compression) to push TTFT and capacity further.

  • Integrate with speculative decoding, KV compression/quantization, and prefill caching to reduce chunk overhead and HBM traffic.

  • TBT (Time-Between-Tokens): The elapsed time between two consecutive output tokens during decoding, often tracked at P99 to capture tail latency.

  • Token budget (τ): A per-iteration cap on the total tokens processed (decodes + prefill chunks) chosen to satisfy TBT SLOs.

  • Iteration-level batching: Allowing requests to join or leave a batch at each model iteration.

  • Request-level batching: Running a fixed set of requests to completion before admitting new ones.

  • Per-tenant QoS: Policies ensuring each tenant gets specified performance (latency/throughput) or resource shares.

  • Telemetry: Metrics and traces collected from the system to monitor performance and guide tuning.