Summary for paper ‘Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms’
Summary for paper ‘PyTorch: An Imperative Style, High-Performance Deep Learning Library’
Summary for paper ‘AWQ: Activation-Aware Weight Quantization for on-device LLM Compression and Acceleration’
Summary for paper ‘Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve’
Summary for paper ‘SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills’
Summary for paper ‘SGLang: Efficient Execution of Structured Language Model Programs’
My bi-weekly journal for contributions to vllm.
Summary for paper ‘MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models’
Summary for paper ‘EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test’
Summary for paper ‘DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model’