Bi-weekly Journal: Contributions to vLLM
Summary
My bi-weekly journal for contributions to vllm.
Current data points: 140+ commits that is merged into main branch, 400+ PR reviews.
Oct 1 - Oct 14
Customer-related Bug Fix:
- From Clayton (llm-d):
- Image issue with DeepGEMM: no kernel image is available for execution on the device: Gave technical support and fix in two days
- Log optimization #26322
- From Lu Fang (Meta)
- WIP: Improve vLLM CUDA Memory Utilization and Estimation #26300
Batch Invariant:
- Closely collaborate with Bram Wasti, milestone doc: vLLM Batch-Invariance Work List
- Landed Flashinfer support #26373
- WIP: Deepseek-v3 Batch Invariant on 8xH100 https://github.com/vllm-project/vllm/pull/26609
- Several other small PRs
Community Leadership:
- Lead implementation
- Vectorize RMS norm variance using vectorize_read_with_alignment #26234
- Fix all of the mypy check, issue in #26533
- Fix mypy for vllm/attention and vllm/compilation #26482
- Fix mypy for vllm/distributed #26593
- Fix mypy for vllm/engine and vllm/utils #26540
- Fix mypy for vllm/executor #26845
- Reduce Unit Test to Speed Up CI #22041
- Prune Tests in kernel/mamba #26538
- Pruning kernel Core Tests #26727
- Mentioned and deep review
- #26669: support flashinfer_fp4 moe for 5090 gpu
- #25619: Speedup DeepGEMM warmup with heuristics
- #26438: : TypeError: argument ‘id’: StreamInput must be either an integer or a list of integers
- + a lot more
VLLM Contributions:
- Huge Performance Improvement
- Enable E8M0 by Default on Hopper for DeepGEMM, 5% E2E throughput improvement: #26197
- Refactoring PRs merged
- Bug Fix PRs merged:
Sep 17 - Sep 30
DeepSeekV3.2 Support
- One week with a tight timeline, working through weekends, closely work with Chen Zhang, Yongye Zhu, Kaichao You, etc.
- Main PR: #25896
- Release note: https://blog.vllm.ai/2025/09/29/deepseek-v3-2.html
- Wentao Ye in the Acknowledgements!
- My Work (All PRs combined)
- Everything with DeepGEMM
- Wheels, test script, B200 validation
- Weight loading issue etc like #25909
Customer-related Bug Fix:
- From Clayton(llm-d):
- vLLM 0.11.0 release blocker
VLLM Contributions:
- Several Refactoring/Fix PRs merged: #25958 #25710 #25519 #25518 #25517 + several more
- Leadership:
- Mentioned by Community and Deep Review
Sep 3 - Sep 16
Performance Optimization:
- Optimize DeepGEMM scale Contiguous Layout
- https://github.com/vllm-project/vllm/pull/24783
- 5.5% Throughput Improvement
- Ready for review: Triton Kernel for per_block_cast_to_fp8, 6x faster
- https://github.com/vllm-project/vllm/pull/24611
- 6x faster for the torch version
Severe Bug Fix:
- Clayton’s torch compile cache issue: https://github.com/vllm-project/vllm/issues/24915
- Torch Inductor Graph issue:
DBO support
- DBO PR get landed: https://github.com/vllm-project/vllm/pull/23693 (Work together with Sage and Lucas)
- HT support for DBO PR ready for review (combined with Lucas’ prefill support) https://github.com/vllm-project/vllm/pull/24845
VLLM Contributions
- Several Refactoring/Fix PRs merged: #24902 #24887 #24774 #24696 #24674 + 4 other PRs
- Several fix for CI: #24259 #24670
- Reviewed 40+ PRs
Aug 20 - Sep 2
Model Support for Deepseek V3.1:
- Add Hopper DeepGEMM E8M0 for DeepSeekV3.1 scale_fmt
- https://github.com/vllm-project/vllm/pull/23666
Performance Optimization:
- Enable Piecewise CUDAGraph for DeepEP HT
- https://github.com/vllm-project/vllm/pull/24123
- 33% E2E Throughput improvement for Decode
- Enable DeepGEMM Linear on B200
- https://github.com/vllm-project/vllm/pull/23351
- 1.5% E2E throughput improvement
Severe Bug Fix
- R1 Accuracy issue: routed_scaling_factor double mul
- https://github.com/vllm-project/vllm/pull/24119
- Meta is using vLLM main to deploy
- Meta reach out to express gratitude for the fast fix
- Full Cuda graph Hang issue
- https://github.com/vllm-project/vllm/pull/23595
- Temporarily fix and will do more exploration later
DBO support
- https://github.com/vllm-project/vllm/pull/23693 (Work together with Sage and Lucas)
- HT single handle issue fixed
VLLM Contributions
Aug 6 - Aug 19
I am nominated to be a vllm committer! Thank so much to Kaichao Michael Goin Robert Shaw,Taneem Ibrahim, Yuan Tang and the vLLM community!
https://github.com/vllm-project/vllm/pull/22741
B200 Performance Optimization:
- Cutlass MLA full cuda graph support
- https://github.com/vllm-project/vllm/pull/22763
- Also needed for DBO
- 6% E2E Throughput Improvement
- Bug fix for FusedMoEModularKernel #22757
DBO support:
- Several bugs fixed
- Fix set forward context error
- Fix assert error num_tokens_across_dp is None
- Fix ubatch datatype issue
- Fix R1 accuracy issue
- Build on B200 system, it is easy to benchmark now
VLLM Contributions:
July 24 – Aug 5
B200 Performance Optimization:
- Per-token-group quant CUDA kernel
#21476 — 15× faster than the original Triton kernel (int8).
#21867 — using__nv_fp8_e4m3
, 10% faster for FP8.
Works on all NVIDIA architectures, not only B200. - NVFP4 optimization
Bug fix for Compressed Tensor NVFP4: #21465
Add FlashInfer MoE support for Compressed Tensor NVFP4: #21639 — ~15% E2E throughput. - Other perf wins
Non-contiguous support for FP8 quantization: #21961 — ~1% E2E throughput.
Optimizereshape_and_cache_flash
CUDA kernel: #22036 — 20–40% faster.
B200 New DeepGemm Integration:
✅ Done for this large scope! Special thanks to the help from Kaichao You and Chenggang Zhao
DBO Support:
- WIP: Collaborated with Sage and Lucas — exciting new scope.
Other Contributions:
July 9 – July 23
B200 Performance Optimization:
- Per-token-group quant CUDA kernel for FP8:
#21083 — ~6% E2E improvement; works on all NVIDIA architectures. - WIP at the time: per-token-group quant CUDA kernel for int8 (later landed as #21476).
- NVFP4 optimization:
Bug fix for Compressed Tensor NVFP4 (ready to review then): #21465
B200 New DeepGemm Integration:
- Merged support for breaking DeepGEMM update on B200: #20087
Upstream DeepGEMM PR: deepseek-ai/DeepGEMM #112 - Follow-up optimizations (all merged):
DeepEP low-latency bugfix: #20833
~15% E2E perf improvement: #20841
Breaking change fix: #21187
CUDA init error fix due to DeepGemm: #21312
CI Bug Fixes:
Other Contributions:
June 23 – July 8
B200 Performance Optimization:
- Quant vectorization utils optimization: #20331
+3% E2E for CUDA quant kernels; reusable for FP8 quant,reshape_and_cache_flash
, etc.
B200 New DeepGemm Integration:
- WIP then: support new breaking DeepGEMM for B200: #20087
~40% perf improvement for the GEMM kernel at specific batch sizes.
Special thanks to Michael Goin and Varun Sundar Rabindranath.
B200 DeepEP & PPLX Validation:
- Bug fix: #20094 — validation done.
Severe CI Bug — Fixed:
Other Contributions:
June 9 – June 20
B200 Performance Optimization:
align_moe_block_size
kernel optimization: #19572 — ~6% E2E throughput.- Benchmark script refactor for GEMM: #19627 — made future quant benchmarking easier.
B200 DeepGemm Integration:
- Initial integration: #19820 — ~40% GEMM perf improvement.
Thanks to Robert Shaw!
B200 DeepEP Integration:
- Env setup & initial exploration.
Other Contributions:
- Helped review several PRs.
June 2 – June 7
B200 Performance Optimization:
- Int8 quant kernel optimization: #19233 — ~10% E2E throughput on B200.
Thanks to Michael Goin’s guidance!
My first vLLM PR!
Other Contributions:
- Raised issues and reviewed several PRs.