Bi-weekly Journal: Contributions to vLLM
Summary
My bi-weekly journal for contributions to vllm.
Current data points: 270+ commits that is merged into main branch, 600+ PR reviews.
All contributions: https://github.com/vllm-project/vllm/graphs/contributors
All PR reviews: https://github.com/vllm-project/vllm/pulls?q=is%3Apr+is%3Aopen+reviewed-by%3Ayewentao256+
Jan 21 - Feb 3
Async Scheduling:
- Done: : Async Scheduling + Pipeline Parallel Support (V1) https://github.com/vllm-project/vllm/issues/32701
- Landed https://github.com/vllm-project/vllm/pull/32618 Fully support for async scheduling + PP, 30.8% E2E throughput improvement, 31.8% TPOT improvement
- Other Optimizations
- Under review: https://github.com/vllm-project/vllm/pull/32975 Optimize detokenizer python logic
- Under review: https://github.com/vllm-project/vllm/pull/33612 Optimize spec decoding + async scheduling, 1.5% Throughput improvement
Performance optimizations:
- Landed https://github.com/vllm-project/vllm/pull/32892 Optimize moe_permute kernel, 40%~300% kernel performance improvement
- Under review: https://github.com/vllm-project/vllm/pull/33593 Optimize Python Slice Operation using islice instead of
- Under review: https://github.com/vllm-project/vllm/pull/33449 Remove align block size logic in moe_permute
- Under review: https://github.com/vllm-project/vllm/pull/33368 Pipeline Parallel Async send/recv, 2.9% E2E throughput improvement
- Landed Optimize dcp allocate tensor https://github.com/vllm-project/vllm/pull/33102
Batch Invariant:
- Lead and track all progress in https://github.com/vllm-project/vllm/issues/27433
- Leading community to deliver bug fixes / features
VLLM Contributions:
- Refactoring PRs
- Bug Fix PRs:
- Lead Fix all of the mypy check, issue in #26533
Jan 5 - Jan 20
Async Scheduling:
- : Async Scheduling + Pipeline Parallel Support
- Landed Optimizations
- https://github.com/vllm-project/vllm/pull/32211 Optimize requests abort
- https://github.com/vllm-project/vllm/pull/32056 Optimize async scheduling placeholder using empty
- https://github.com/vllm-project/vllm/pull/32034 Remove numpy split in async scheduling
Performance optimizations:
- (Done) Tracking issue: Optimizations for MOE cutlass models https://github.com/vllm-project/vllm/issues/31755
- Tasks (All landed)
- Optimize grouped_topk kernel Optimize group_topk kernel, 1.9% Throughput improvement, 2.1% TPOT improvemnt #30159
- Optimize additional fill(0) in cutlass moe Optimize additional fill(0) in cutlass moe, 2.9% E2E throughput improvement, 10.8% TTFT improvement #31754
- Optimize cutlass moe problem size calculation Optimize cutlass moe problem size calculation, 5.3% E2E Throughput improvement, 2.2% TTFT improvement #31830
- Optimize group topk kernel further Optimize grouped topk kernel, 1.2%~2% E2E Throughput improvement #32058
Batch Invariant:
- Lead and track all progress in https://github.com/vllm-project/vllm/issues/27433
- vLLM office hour speaker (Jan 08) https://docs.google.com/presentation/d/1iaZkoyf2VDQFc3TB2DGld2MpZ-uLsVrXdtYWfT9hEBc/edit?slide=id.g39235cbcce8_2_318#slide=id.g39235cbcce8_2_318
https://www.youtube.com/watch?v=sDLR9DvEFq4
vLLM Contributions:
- Refactoring PRs
- Bug Fix PRs:
- Lead Fix all of the mypy check, issue in #26533
Dec 23 - Jan 4
- Bug Fix PRs merged:
Dec 10 - Dec 22
Batch Invariant:
- Lead and track all progress in https://github.com/vllm-project/vllm/issues/27433
- Landed Fix batch invariant in torch 2.10 https://github.com/vllm-project/vllm/pull/30907
Performance optimizations:
- Landed: Optimize deepgemm experts initialization, 3.9% TTFT improvement https://github.com/vllm-project/vllm/pull/30494
Release Blocker Bug FIx:
- WIP: https://github.com/vllm-project/vllm/pull/30914 Fix torch inductor issue (shape passing through sub-graphs)
- Landed https://github.com/vllm-project/vllm/pull/31046 Fix error ‘Dynamo failed to run FX node with fake tensors for Deepseek V3.2
- WIP: https://github.com/vllm-project/vllm/pull/31160 Fix Number of dimensions of tensors must match. for Deepseek V3.2
VLLM Contributions:
- Refactoring PRs
- https://github.com/vllm-project/vllm/pull/30898 Refactor for DeepGemmQuantScaleFMT using cache
- https://github.com/vllm-project/vllm/pull/30562 Small refactor for group topk
- https://github.com/vllm-project/vllm/pull/30559 Enable eplb with default all2all backend
- https://github.com/vllm-project/vllm/pull/30282 Refactor for parallel_config in FusedMoEModularKernel
- … And several more
- Bug Fix PRs merged:
- https://github.com/vllm-project/vllm/pull/31173 Fix ‘CutlassMLAImpl’ object has no attribute ‘_workspace_buffer’
- https://github.com/vllm-project/vllm/pull/30823 Fix AttributeError: ‘ColumnParallelLinear’ object has no attribute weight_scale_inv
- https://github.com/vllm-project/vllm/pull/30820 Fix compressed tensor not using deepgemm
- … And more
- 40+ PR reviews
- Lead Fix all of the mypy check, issue in #26533
- https://github.com/vllm-project/vllm/pull/30517 Fix mypy for vllm/v1/executor
Nov 25 - Dec 09
Batch Invariant:
- Landed: Optimize batch invariant BMM, 18.1% Throughput improvement, 10.7% TTFT improvement https://github.com/vllm-project/vllm/pull/29345
- Lead and track all progress in https://github.com/vllm-project/vllm/issues/27433
- Landed: Batch invariant: Enable TRITON_MLA without prefix-caching https://github.com/vllm-project/vllm/pull/29125
- Some other refactor bug fix / Optimizations PRs
Performance optimizations:
- Landed: Optimize group_topk kernel, 1.9% Throughput improvement, 2.1% TPOT improvemnt https://github.com/vllm-project/vllm/pull/30159
- Landed: Enable cuda graph for deepepHT**, 5.3% throughput improvement, 4.4% TTFT improvement** https://github.com/vllm-project/vllm/pull/29558
- Landed: Deepgemm fused layout kernel for activations, 4.3% throughput improvement, 10.7% TTFT improvement. https://github.com/vllm-project/vllm/pull/29546
- Due to similar model architecture, these optimizations could be also used to deepseek model series automatically.
VLLM Contributions:
- Refactoring PRs
- https://github.com/vllm-project/vllm/pull/29903 Log optimization
- WIP: https://github.com/vllm-project/vllm/pull/30282 Refactor for parallel_config in FusedMoEModularKernel
- https://github.com/vllm-project/vllm/pull/29897 Add env `VLLM_FLOAT32_MATMUL_PRECISION` to fix torch warning
- Bug Fix PRs merged:
- https://github.com/vllm-project/vllm/pull/29999 Fix vLLM config is not set issue that bother committers for almost two weeks
- https://github.com/vllm-project/vllm/pull/29973 Fix re import error
- 50+ PR reviews
- Lead Fix all of the mypy check, issue in #26533
Nov 13 - Nov 24
Batch Invariant:
- Under review: Optimize batch invariant BMM, 18.1% Throughput improvement, 10.7% TTFT improvement https://github.com/vllm-project/vllm/pull/29345
- Lead and track all progress in https://github.com/vllm-project/vllm/issues/27433
- Under review: Batch invariant: Enable TRITON_MLA without prefix-caching https://github.com/vllm-project/vllm/pull/29125
- Landed: add to CI https://github.com/vllm-project/vllm/pull/27842
- Several other bug fix / Optimizations PRs
CUDA fused MOE optimizations:
- Shared Experts Overlap with FI deepgemm swap kernel, 2.2% throughput improvement and 3.6% TTFT improvement https://github.com/vllm-project/vllm/pull/28879
- Landed Optimize select_experts https://github.com/vllm-project/vllm/pull/28069
- Several other PRs optimizations
VLLM Contributions:
- Refactoring PRs merged
- Bug Fix PRs merged:
- #29202 #29112 #29040 and several more
- Fix torch dynamo warning Dynamo detected a call to a functools.lru_cache, much faster for the dynamo tracing. https://github.com/vllm-project/vllm/pull/29038
- 50+ PR reviews
- Lead Fix all of the mypy check, issue in #26533
- Fix mypy for vllm/v1/worker https://github.com/vllm-project/vllm/pull/29037
Oct 29 - Nov 12
Batch Invariant:
No More Train-Inference Mismatch: Bitwise Consistent On-Policy Reinforcement Learning with vLLM and TorchTitan:
- https://blog.vllm.ai/2025/11/10/bitwise-consistent-train-inference.html
- Authors: Bram Wasti, Wentao Ye, Teja Rao, Michael Goin, Paul Zhang, Tianyu Liu, Natalia Gimelshein, Woosuk Kwon, Kaichao You, Zhuohan Li
Lead and track all progress in https://github.com/vllm-project/vllm/issues/27433
WIP: Support DP + EP + FLASHINFER_MLA for R1 https://github.com/vllm-project/vllm/pull/27421
WIP: add to CI https://github.com/vllm-project/vllm/pull/27842
Fix torch.dynamo.exc.Unsupported: Logger not supported for non-export cases https://github.com/vllm-project/vllm/pull/27606
Several other bug fix / Optimizations PRs
CUDA fused MOE optimizations:
Enable TP + EP shared_experts overlap with router, 3.7% E2E performance improvement, 24% faster for time to first token https://github.com/vllm-project/vllm/pull/28164
WIP: Optimize select_experts https://github.com/vllm-project/vllm/pull/28069
Landed VLLM_MOE_USE_DEEP_GEMM support to split deepgemm / triton https://github.com/vllm-project/vllm/pull/28422
Help Eliza optimize fused rms quant kernel (30ms -> 20ms, 50% faster) https://github.com/vllm-project/vllm/pull/27883#issuecomment-3505283883
Community Leadership:
- Lead implementation
- Mentioned and deep review
VLLM Contributions:
Oct 15 - Oct 28
Batch Invariant
Feature supported and announced! https://x.com/vllm_project/status/1981088861506982041
Lead and track all progress in https://github.com/vllm-project/vllm/issues/27433
Batch Invariant: Support DP + EP + FLASHINFER_MLA for R1 https://github.com/vllm-project/vllm/pull/27421
Batch Invariant: Support DeepGEMM and Blackwell https://github.com/vllm-project/vllm/pull/27127
Under review: Torch compile support https://github.com/vllm-project/vllm/pull/27660
And several other Supporting PRs
Customer-related Bug Fix:
- From Clayton (Google, llm-d):
- Ready to merge: Fix DeepEP low latency assert self.batched_router_logits.size(-1) == full_router_logits.size(-1) https://github.com/vllm-project/vllm/pull/27682
- Fix deepep low latency nvlink usage issue https://github.com/vllm-project/vllm/pull/27677
- Ready to merge: Fix DBO IMA issue for DeepEPHT https://github.com/vllm-project/vllm/pull/27666
- Fix Fix shape issue for eplb expert weights https://github.com/vllm-project/vllm/pull/27589
Community Leadership:
- Lead implementation
- Mentioned and deep review
VLLM Contributions:
Oct 1 - Oct 14
Customer-related Bug Fix:
- From Clayton (llm-d):
- Image issue with DeepGEMM: no kernel image is available for execution on the device: Gave technical support and fix in two days
- Log optimization #26322
- From Lu Fang (Meta)
- WIP: Improve vLLM CUDA Memory Utilization and Estimation #26300
Batch Invariant:
- Closely collaborate with Bram Wasti, milestone doc: vLLM Batch-Invariance Work List
- Landed Flashinfer support #26373
- WIP: Deepseek-v3 Batch Invariant on 8xH100 https://github.com/vllm-project/vllm/pull/26609
- Several other small PRs
Community Leadership:
- Lead implementation
- Vectorize RMS norm variance using vectorize_read_with_alignment #26234
- Fix all of the mypy check, issue in #26533
- Fix mypy for vllm/attention and vllm/compilation #26482
- Fix mypy for vllm/distributed #26593
- Fix mypy for vllm/engine and vllm/utils #26540
- Fix mypy for vllm/executor #26845
- Reduce Unit Test to Speed Up CI #22041
- Prune Tests in kernel/mamba #26538
- Pruning kernel Core Tests #26727
- Mentioned and deep review
- #26669: support flashinfer_fp4 moe for 5090 gpu
- #25619: Speedup DeepGEMM warmup with heuristics
- #26438: : TypeError: argument ‘id’: StreamInput must be either an integer or a list of integers
- + a lot more
VLLM Contributions:
- Huge Performance Improvement
- Enable E8M0 by Default on Hopper for DeepGEMM, 5% E2E throughput improvement: #26197
- Refactoring PRs merged
- Bug Fix PRs merged:
Sep 17 - Sep 30
DeepSeekV3.2 Support
- One week with a tight timeline, working through weekends, closely work with Chen Zhang, Yongye Zhu, Kaichao You, etc.
- Main PR: #25896
- Release note: https://blog.vllm.ai/2025/09/29/deepseek-v3-2.html
- Wentao Ye in the Acknowledgements!
- My Work (All PRs combined)
- Everything with DeepGEMM
- Wheels, test script, B200 validation
- Weight loading issue etc like #25909
Customer-related Bug Fix:
- From Clayton(llm-d):
- vLLM 0.11.0 release blocker
VLLM Contributions:
- Several Refactoring/Fix PRs merged: #25958 #25710 #25519 #25518 #25517 + several more
- Leadership:
- Mentioned by Community and Deep Review
Sep 3 - Sep 16
Performance Optimization:
- Optimize DeepGEMM scale Contiguous Layout
- https://github.com/vllm-project/vllm/pull/24783
- 5.5% Throughput Improvement
- Ready for review: Triton Kernel for per_block_cast_to_fp8, 6x faster
- https://github.com/vllm-project/vllm/pull/24611
- 6x faster for the torch version
Severe Bug Fix:
- Clayton’s torch compile cache issue: https://github.com/vllm-project/vllm/issues/24915
- Torch Inductor Graph issue:
DBO support
- DBO PR get landed: https://github.com/vllm-project/vllm/pull/23693 (Work together with Sage and Lucas)
- HT support for DBO PR ready for review (combined with Lucas’ prefill support) https://github.com/vllm-project/vllm/pull/24845
VLLM Contributions
- Several Refactoring/Fix PRs merged: #24902 #24887 #24774 #24696 #24674 + 4 other PRs
- Several fix for CI: #24259 #24670
- Reviewed 40+ PRs
Aug 20 - Sep 2
Model Support for Deepseek V3.1:
- Add Hopper DeepGEMM E8M0 for DeepSeekV3.1 scale_fmt
- https://github.com/vllm-project/vllm/pull/23666
Performance Optimization:
- Enable Piecewise CUDAGraph for DeepEP HT
- https://github.com/vllm-project/vllm/pull/24123
- 33% E2E Throughput improvement for Decode
- Enable DeepGEMM Linear on B200
- https://github.com/vllm-project/vllm/pull/23351
- 1.5% E2E throughput improvement
Severe Bug Fix
- R1 Accuracy issue: routed_scaling_factor double mul
- https://github.com/vllm-project/vllm/pull/24119
- Meta is using vLLM main to deploy
- Meta reach out to express gratitude for the fast fix
- Full Cuda graph Hang issue
- https://github.com/vllm-project/vllm/pull/23595
- Temporarily fix and will do more exploration later
DBO support
- https://github.com/vllm-project/vllm/pull/23693 (Work together with Sage and Lucas)
- HT single handle issue fixed
VLLM Contributions
Aug 6 - Aug 19
I am nominated to be a vllm committer! Thank so much to Kaichao Michael Goin Robert Shaw,Taneem Ibrahim, Yuan Tang and the vLLM community!
https://github.com/vllm-project/vllm/pull/22741
B200 Performance Optimization:
- Cutlass MLA full cuda graph support
- https://github.com/vllm-project/vllm/pull/22763
- Also needed for DBO
- 6% E2E Throughput Improvement
- Bug fix for FusedMoEModularKernel #22757
DBO support:
- Several bugs fixed
- Fix set forward context error
- Fix assert error num_tokens_across_dp is None
- Fix ubatch datatype issue
- Fix R1 accuracy issue
- Build on B200 system, it is easy to benchmark now
VLLM Contributions:
July 24 – Aug 5
B200 Performance Optimization:
- Per-token-group quant CUDA kernel
#21476 — 15× faster than the original Triton kernel (int8).
#21867 — using__nv_fp8_e4m3, 10% faster for FP8.
Works on all NVIDIA architectures, not only B200. - NVFP4 optimization
Bug fix for Compressed Tensor NVFP4: #21465
Add FlashInfer MoE support for Compressed Tensor NVFP4: #21639 — ~15% E2E throughput. - Other perf wins
Non-contiguous support for FP8 quantization: #21961 — ~1% E2E throughput.
Optimizereshape_and_cache_flashCUDA kernel: #22036 — 20–40% faster.
B200 New DeepGemm Integration:
✅ Done for this large scope! _Special thanks to the help from Kaichao You and Chenggang Zhao
DBO Support:
- WIP: Collaborated with Sage and Lucas — exciting new scope.
Other Contributions:
July 9 – July 23
B200 Performance Optimization:
- Per-token-group quant CUDA kernel for FP8:
#21083 — ~6% E2E improvement; works on all NVIDIA architectures. - WIP at the time: per-token-group quant CUDA kernel for int8 (later landed as #21476).
- NVFP4 optimization:
Bug fix for Compressed Tensor NVFP4 (ready to review then): #21465
B200 New DeepGemm Integration:
- Merged support for breaking DeepGEMM update on B200: #20087
Upstream DeepGEMM PR: deepseek-ai/DeepGEMM #112 - Follow-up optimizations (all merged):
DeepEP low-latency bugfix: #20833
~15% E2E perf improvement: #20841
Breaking change fix: #21187
CUDA init error fix due to DeepGemm: #21312
CI Bug Fixes:
Other Contributions:
June 23 – July 8
B200 Performance Optimization:
- Quant vectorization utils optimization: #20331
+3% E2E for CUDA quant kernels; reusable for FP8 quant,reshape_and_cache_flash, etc.
B200 New DeepGemm Integration:
- WIP then: support new breaking DeepGEMM for B200: #20087
~40% perf improvement for the GEMM kernel at specific batch sizes.
Special thanks to Michael Goin and Varun Sundar Rabindranath.
B200 DeepEP & PPLX Validation:
- Bug fix: #20094 — validation done.
Severe CI Bug — Fixed:
Other Contributions:
June 9 – June 20
B200 Performance Optimization:
align_moe_block_sizekernel optimization: #19572 — ~6% E2E throughput.- Benchmark script refactor for GEMM: #19627 — made future quant benchmarking easier.
B200 DeepGemm Integration:
- Initial integration: #19820 — ~40% GEMM perf improvement.
Thanks to Robert Shaw!
B200 DeepEP Integration:
- Env setup & initial exploration.
Other Contributions:
- Helped review several PRs.
June 2 – June 7
B200 Performance Optimization:
- Int8 quant kernel optimization: #19233 — ~10% E2E throughput on B200.
Thanks to Michael Goin’s guidance! My first vLLM PR!
Other Contributions:
- Raised issues and reviewed several PRs.