Wentao's Blog

Summary: Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

yewentao 发布于 2025-08-24 收录于类别 Paper_summary

论文速览：‘Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve’

yewentao 发布于 2025-08-17 收录于类别 Paper_summary

论文速览：‘SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills’

yewentao 发布于 2025-08-09 收录于类别 Paper_summary

论文速览：‘SGLang: Efficient Execution of Structured Language Model Programs’

yewentao 发布于 2025-08-09 收录于类别 Vllm

My bi-weekly journal for contributions to vllm.

yewentao 发布于 2025-08-03 收录于类别 Paper_summary

论文速览 ‘MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models’

yewentao 发布于 2025-07-26 收录于类别 Paper_summary

论文速览 ‘EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test’

yewentao 发布于 2025-07-20 收录于类别 Paper_summary

论文速览 ‘DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model’

yewentao 发布于 2025-07-12 收录于类别 Paper_summary

论文速览 ‘DeepSeek-V3 Technical Report’

yewentao 发布于 2025-07-05 收录于类别 Paper_summary

论文速览 ‘FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision’

yewentao 发布于 2025-06-29 收录于类别 Paper_summary

论文速览 ‘FLASHINFER: EFFICIENT AND CUSTOMIZABLE ATTENTION ENGINE FOR LLM INFERENCE SERVING’