Contents

Bi-weekly Journal: Contributions to vLLM

My bi-weekly journal for contributions to vllm.

Current data points: 140+ commits that is merged into main branch, 400+ PR reviews.


Customer-related Bug Fix:

  • From Clayton (llm-d):
    • Image issue with DeepGEMM: no kernel image is available for execution on the device: Gave technical support and fix in two days
    • Log optimization #26322
  • From Lu Fang (Meta)
    • WIP: Improve vLLM CUDA Memory Utilization and Estimation #26300

Batch Invariant:

Community Leadership:

  • Lead implementation
    • Vectorize RMS norm variance using vectorize_read_with_alignment #26234
    • Fix all of the mypy check, issue in #26533
      • CICI Fix mypy for vllm/attention and vllm/compilation #26482
      • CICI Fix mypy for vllm/distributed #26593
      • CICI Fix mypy for vllm/engine and vllm/utils #26540
      • CICI Fix mypy for vllm/executor #26845
    • Reduce Unit Test to Speed Up CI #22041
      • CIPerfCI PerfPrune Tests in kernel/mamba #26538
      • Pruning kernel Core Tests #26727
  • Mentioned and deep review
    • #26669: support flashinfer_fp4 moe for 5090 gpu
    • #25619: UXUX Speedup DeepGEMM warmup with heuristics
    • #26438: BugBug: TypeError: argument ‘id’: StreamInput must be either an integer or a list of integers
    • + a lot more

VLLM Contributions:

DeepSeekV3.2 Support

  • One week with a tight timeline, working through weekends, closely work with Chen Zhang, Yongye Zhu, Kaichao You, etc.
  • Main PR: #25896
  • Release note: https://blog.vllm.ai/2025/09/29/deepseek-v3-2.html
    • Wentao Ye in the Acknowledgements!
  • My Work (All PRs combined)
    • Everything with DeepGEMM
    • Wheels, test script, B200 validation
    • Weight loading issue etc like #25909

Customer-related Bug Fix:

  • From Clayton(llm-d):
    • Under review: Fixed Negative cuda memory usage: #25683
    • Fixed OOM issue: #25290
    • Fixed Cudagraph cache issue: #25093
  • vLLM 0.11.0 release blocker
    • Issue related with B200 for Qwen3-VL
    • Raised in #25582 and fixed by #25788, working closely with Roger Wang

VLLM Contributions:

  • Several Refactoring/Fix PRs merged: #25958 #25710 #25519 #25518 #25517 + several more
  • Leadership:
    • Guide Community to produce better code #22602
    • Feature Request to Optimize reshape_and_cache CUDA Kernel #25705
    • Feature Request to Reduce unit test in CI #22041
  • Mentioned by Community and Deep Review

Performance Optimization:

Severe Bug Fix:

DBO support

VLLM Contributions

Model Support for Deepseek V3.1:

Performance Optimization:

Severe Bug Fix

DBO support

VLLM Contributions

I am nominated to be a vllm committer! Thank so much to Kaichao Michael Goin Robert Shaw,Taneem Ibrahim, Yuan Tang and the vLLM community!

https://github.com/vllm-project/vllm/pull/22741

B200 Performance Optimization:

DBO support:

  • Several bugs fixed
    • Fix set forward context error
    • Fix assert error num_tokens_across_dp is None
    • Fix ubatch datatype issue
    • Fix R1 accuracy issue
  • Build on B200 system, it is easy to benchmark now

VLLM Contributions:

B200 Performance Optimization:

  • Per-token-group quant CUDA kernel
    #2147615× faster than the original Triton kernel (int8).
    #21867 — using __nv_fp8_e4m3, 10% faster for FP8.
    Works on all NVIDIA architectures, not only B200.
  • NVFP4 optimization
    Bug fix for Compressed Tensor NVFP4: #21465
    Add FlashInfer MoE support for Compressed Tensor NVFP4: #21639 — ~15% E2E throughput.
  • Other perf wins
    Non-contiguous support for FP8 quantization: #21961 — ~1% E2E throughput.
    Optimize reshape_and_cache_flash CUDA kernel: #2203620–40% faster.

B200 New DeepGemm Integration:

DBO Support:

  • WIP: Collaborated with Sage and Lucas — exciting new scope.

Other Contributions:


B200 Performance Optimization:

  • Per-token-group quant CUDA kernel for FP8:
    #21083 — ~6% E2E improvement; works on all NVIDIA architectures.
  • WIP at the time: per-token-group quant CUDA kernel for int8 (later landed as #21476).
  • NVFP4 optimization:
    Bug fix for Compressed Tensor NVFP4 (ready to review then): #21465

B200 New DeepGemm Integration:

  • Merged support for breaking DeepGEMM update on B200: #20087
    Upstream DeepGEMM PR: deepseek-ai/DeepGEMM #112
  • Follow-up optimizations (all merged):
    DeepEP low-latency bugfix: #20833
    ~15% E2E perf improvement: #20841
    Breaking change fix: #21187
    CUDA init error fix due to DeepGemm: #21312

CI Bug Fixes:

Other Contributions:

  • Code-refactoring PRs merged: #20770, #20774, and others
  • Reviewed 10+ PRs.

B200 Performance Optimization:

  • Quant vectorization utils optimization: #20331
    +3% E2E for CUDA quant kernels; reusable for FP8 quant, reshape_and_cache_flash, etc.

B200 New DeepGemm Integration:

B200 DeepEP & PPLX Validation:

  • Bug fix: #20094 — validation done.

Severe CI Bug — Fixed:

  • Issue raised (blocker in main for ~1 month): #20138
    Fix in two days: #20204

Other Contributions:


B200 Performance Optimization:

  • align_moe_block_size kernel optimization: #19572 — ~6% E2E throughput.
  • Benchmark script refactor for GEMM: #19627 — made future quant benchmarking easier.

B200 DeepGemm Integration:

B200 DeepEP Integration:

  • Env setup & initial exploration.

Other Contributions:

  • Helped review several PRs.

B200 Performance Optimization:

  • Int8 quant kernel optimization: #19233 — ~10% E2E throughput on B200.
    Thanks to Michael Goin’s guidance!
    My first vLLM PR!

Other Contributions:

  • Raised issues and reviewed several PRs.