Summary for paper ‘AWQ: Activation-Aware Weight Quantization for on-device LLM Compression and Acceleration’
Summary for paper ‘Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve’
Summary for paper ‘SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills’
Summary for paper ‘SGLang: Efficient Execution of Structured Language Model Programs’
My bi-weekly journal for contributions to vllm.
Summary for paper ‘MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models’
Summary for paper ‘EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test’
Summary for paper ‘DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model’
Summary for paper ‘DeepSeek-V3 Technical Report’
Summary for paper ‘FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision’