Summary for paper ‘SGLang: Efficient Execution of Structured Language Model Programs’
My bi-weekly journal for contributions to vllm.
Summary for paper ‘MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models’
Summary for paper ‘EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test’
Summary for paper ‘DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model’
Summary for paper ‘DeepSeek-V3 Technical Report’
Summary for paper ‘FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision’
Summary for paper ‘FLASHINFER: EFFICIENT AND CUSTOMIZABLE ATTENTION ENGINE FOR LLM INFERENCE SERVING’
Summary for paper ‘Training Compute-Optimal Large Language Models’
Summary for paper ‘Outrageously Large Neural Networks: The Sparsely-Gated Mixture-Of-Experts Layer’