论文速览:‘Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve’
论文速览:‘SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills’
论文速览:‘SGLang: Efficient Execution of Structured Language Model Programs’
My bi-weekly journal for contributions to vllm.
论文速览 ‘MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models’
论文速览 ‘EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test’
论文速览 ‘DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model’
论文速览 ‘DeepSeek-V3 Technical Report’
论文速览 ‘FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision’
论文速览 ‘FLASHINFER: EFFICIENT AND CUSTOMIZABLE ATTENTION ENGINE FOR LLM INFERENCE SERVING’