Bi-weekly Journal: Contributions to vLLM (2026)

yewentao included in category Vllm

2026-05-30 2026-07-04 2358 words 11 minutes

Contents

Summary

My bi-weekly journal for contributions to vllm.

All contributions: https://github.com/vllm-project/vllm/graphs/contributors
All PR reviews: https://github.com/vllm-project/vllm/pulls?q=is%3Apr+is%3Aopen+reviewed-by%3Ayewentao256+

June 10 - June 23

GPU Model Runner V2 Maintainance:

Migration from v1 to v2: https://github.com/vllm-project/vllm/issues/41286
Landed: https://github.com/vllm-project/vllm/pull/42759 Migrate Reset cache for both v2 and v1 model runner
Landed: https://github.com/vllm-project/vllm/pull/42667 Migration from v1 to v2, with Qwen and DSv2 MOE models $3/N$
Under review: https://github.com/vllm-project/vllm/pull/44443 Enable all dense models for mrv2 $4/N$ $4/ N$
- Landed: https://github.com/vllm-project/vllm/pull/44568 Fix v2 AttributeError: ‘CohereASRDecoder’ object has no attribute ’embed_input_ids’
- Landed: https://github.com/vllm-project/vllm/pull/46095 Fix MRv2 memory leak test
- Landed: https://github.com/vllm-project/vllm/pull/45467 Fix openai.InternalServerError: Error code: 500 - ’list index out of range'
Landed: https://github.com/vllm-project/vllm/pull/44446 Migration to support quantized model by default $5/N$
Landed https://github.com/vllm-project/vllm/pull/45461 Enable GraniteMOE for MRv2 by default
Under review: https://github.com/vllm-project/vllm/pull/46646 Enable all moe models for MRv2

Large Scaling Serving

Under review: https://github.com/vllm-project/vllm/pull/44915 $Feature$ Migrate DP Supervisor from Python to Rust

DeepSeek V4

Performance Tracking issue: https://github.com/vllm-project/vllm/issues/45861
Landed: https://github.com/vllm-project/vllm/pull/45061 Optimize DSv4 prefill chunk planning, 4.0% E2E Throughput Improvement
Landed: https://github.com/vllm-project/vllm/pull/45052 $Bug$ Fix test flashmla for DSv4
Landed https://github.com/vllm-project/vllm/pull/45863 DSv4 flashinfer sparse index cache for metadata, 2%~4% TTFT improvement

GLM 5.2

Performance tracking issue: https://github.com/vllm-project/vllm/issues/46654
Under review: https://github.com/vllm-project/vllm/pull/46635 Replace MOE all-reduce with reduce-scatter, 3.1%~3.2 E2E Throughput improvement

Kernel Optimization

Under review: https://github.com/vllm-project/vllm/pull/43137 Optimize per_token_group_quant using regsiter directly, 4.5% E2E Throughput improvement
Landed: https://github.com/vllm-project/vllm/pull/44572 $Perf$ SM90 cutlass fp8 mm supports odd M by swap_ab, 180~290% kernel performance improvement

vLLM Contributions

May 27 - June 09

GPU Model Runner V2

Migration from v1 to v2: https://github.com/vllm-project/vllm/issues/41286
Landed: https://github.com/vllm-project/vllm/pull/42665 Migration from v1 to v2, with more Llama and Mistral dense models $2/N$ $2/ N$
- Under review: https://github.com/vllm-project/vllm/pull/42759 Migrate Reset cache for both v2 and v1 model runner
Under review: https://github.com/vllm-project/vllm/pull/42667 Migration from v1 to v2, with Qwen and DSv2 MOE models $3/N$ $3/ N$
- Under review: https://github.com/vllm-project/vllm/pull/43915 Feature: Support ElasticEPScalingExecutor for MRv2
Under review: https://github.com/vllm-project/vllm/pull/44443 Enable all dense models for mrv2 $4/N$ $4/ N$
- Landed: https://github.com/vllm-project/vllm/pull/44450 Fix mrv2 mm lora issue
- Under review: https://github.com/vllm-project/vllm/pull/44568 Fix v2 AttributeError: ‘CohereASRDecoder’ object has no attribute ’embed_input_ids’
Under review: https://github.com/vllm-project/vllm/pull/44446 Migration to support quantized model by default $5/N$

Large Scaling Serving

Landed: https://github.com/vllm-project/vllm/pull/43707 Optimize shutdown logs, easier to follow and consistent
Landed: https://github.com/vllm-project/vllm/pull/43688 $Feature$ SSL support for dp supervisor
Under review: https://github.com/vllm-project/vllm/pull/44915 $Feature$ Migrate DP Supervisor from Python to Rust

DeepSeek V4

Under review: https://github.com/vllm-project/vllm/pull/45061 Optimize DSv4 prefill chunk planning, 4.0% E2E Throughput Improvement
Under review: https://github.com/vllm-project/vllm/pull/45052 $Bug$ Fix test flashmla for DSv4
Landed https://github.com/vllm-project/vllm/pull/44914 $Bug$ Fix deepseek v4 OOM issue

Kernel Optimization

Landed: https://github.com/vllm-project/vllm/pull/43706 Optimize cutlass fp8 scaled mm bypassing padding, 20% kernel performance improvement
Under review: http://github.com/vllm-project/vllm/pull/43349 $Perf$ Optimize remapped greedy draft token selection for Eagle3 and DFlash, 37~81% kernel performance improvement
Under review: https://github.com/vllm-project/vllm/pull/43137 Optimize per_token_group_quant using regsiter directly, 4.5% E2E Throughput improvement
Landed: https://github.com/vllm-project/vllm/pull/43014 Optimize moe permute by pre-allocate buffer, 9~14% kernel performance improvement
Under review: https://github.com/vllm-project/vllm/pull/44572 $Perf$ SM90 cutlass fp8 mm supports odd M by swap_ab, 180~290% kernel performance improvement

vLLM Contributions

May 13 - May 26

GPU Model Runner V2:

Migration from v1 to v2: https://github.com/vllm-project/vllm/issues/41286
Landed: https://github.com/vllm-project/vllm/pull/39337 Oracle for model runner v2 - dense model by default $1/N$ $1/ N$
- Landed https://github.com/vllm-project/vllm/pull/41761 Bug fix: logprob dtype int64/int32 issue
Under review: https://github.com/vllm-project/vllm/pull/42665 Migration from v1 to v2, with more Llama and Mistral dense models $2/N$ $2/ N$
- Under review: https://github.com/vllm-project/vllm/pull/42759 Migrate Reset cache for both v2 and v1 model runner
- Landed https://github.com/vllm-project/vllm/pull/43233 Force v1 runner for tests
- Landed https://github.com/vllm-project/vllm/pull/43139 Fix lora Triton Error $CUDA$ : device-side assert triggered
- Landed https://github.com/vllm-project/vllm/pull/42778 Fix prompt logprobs calculation Sizes of tensors must match error
- Landed https://github.com/vllm-project/vllm/pull/42676 Fix kv_connector pre_forward order
- Landed https://github.com/vllm-project/vllm/pull/42673 Support reload weights (sleep mode)
Under review: https://github.com/vllm-project/vllm/pull/42667 Migration from v1 to v2, with Qwen and DSv2 MOE models $3/N$

Large Scaling Serving

Landed: https://github.com/vllm-project/vllm/pull/40841 Support a node-local external DP mode with a single vllm serve and aggregated admin health endpoint
Under review: https://github.com/vllm-project/vllm/pull/43707 Optimize shutdown logs, easier to follow and consistent
Under review: https://github.com/vllm-project/vllm/pull/43688 $Feature$ SSL support for dp supervisor

Kernel Optimization

Under review: https://github.com/vllm-project/vllm/pull/43706 Optimize cutlass fp8 scaled mm bypassing padding, 20% kernel performance improvement
Under review: http://github.com/vllm-project/vllm/pull/43349 $Perf$ Optimize remapped greedy draft token selection for Eagle3 and DFlash, 37~81% kernel performance improvement
Under review: https://github.com/vllm-project/vllm/pull/43137 Optimize per_token_group_quant using regsiter directly, 4.5% E2E Throughput improvement
Under review: https://github.com/vllm-project/vllm/pull/43014 Optimize moe permute by pre-allocate buffer, 9~14% kernel performance improvement
Landed https://github.com/vllm-project/vllm/pull/42988 $Perf$ zeros -> empty to remove additional fill
Landed https://github.com/vllm-project/vllm/pull/42774 $Perf$ Padded nvfp4 quant kernel to remove additional copy, 2.4%~5.7% e2e performance improvement
Landed https://github.com/vllm-project/vllm/pull/42651 $Perf$ Optimize CutlassFP8ScaledMMLinearKernel when padding needed by pre-weight processing, 13.5% TTFT improvement
Landed https://github.com/vllm-project/vllm/pull/42561 $Perf$ Optimize MLA attention _v_up_proj bmm by removing additional copy

Batch Invariant

Landed: https://github.com/vllm-project/vllm/pull/42456 Support compile mode for batch invariance on SM80
Under review: https://github.com/vllm-project/vllm/pull/42453 $Feature$ Support batch invariant rms norm with residual

vLLM Contributions

April 29 - May 12

GPU Model Runner V2:

Migration from v1 to v2: https://github.com/vllm-project/vllm/issues/41286
Under review: https://github.com/vllm-project/vllm/pull/39337 Oracle for model runner v2 - dense model by default $1/N$
- Landed https://github.com/vllm-project/vllm/pull/40559 Add logprob_token_ids support
- Landed https://github.com/vllm-project/vllm/pull/41285 Fix v2 compile counter num_gpu_runner_capture_triggers and num_cudagraph_captured
- Landed https://github.com/vllm-project/vllm/pull/41761 Bug fix: logprob dtype int64/int32 issue
- Under review: https://github.com/vllm-project/vllm/pull/41667 Support stock torch compile for v2

Large Scaling Serving

Under review: https://github.com/vllm-project/vllm/pull/40841 Support a node-local external DP mode with a single vllm serve and aggregated admin health endpoint
Landed: https://github.com/vllm-project/vllm/pull/40839 Fix status update address for non-MOE model within external dp mode
Landed: https://github.com/vllm-project/vllm/pull/39832 $KV Connector$ Remove compat support for pre-v0.12.0 constructor signatures without KVCacheConfig
Landed: https://github.com/vllm-project/vllm/pull/42460 $Perf$ Optimize MLA compute_prefill_context memory allocation

Batch Invariant

Landed: https://github.com/vllm-project/vllm/pull/40408 Batch invariance with Cutlass fp8 support, 28.9% E2E latency improvement
Under review: https://github.com/vllm-project/vllm/pull/42456 Support compile mode for batch invariance on SM80
Under review: https://github.com/vllm-project/vllm/pull/42453 $Feature$ Support batch invariant rms norm with residual

Pooling Model (Performance issue completed https://github.com/vllm-project/vllm/issues/35631)

Landed: https://github.com/vllm-project/vllm/pull/41163 Optimize AllPool.forward by slicing first, 51% faster in the method level benchmark

vLLM Contributions

April 15 - April 28

GPU Model Runner V2:

Under review: https://github.com/vllm-project/vllm/pull/35214 Optimize Sampler Redundant Copy for Model Runner v2, 1.8% Throughput Improvement
- Under review: https://github.com/vllm-project/vllm/pull/39337 Oracle for model runner v2 - dense model by default $1/N$
- Landed https://github.com/vllm-project/vllm/pull/40648 Fix block table IMA issue
- Under review https://github.com/vllm-project/vllm/pull/40559 Add logprob_token_ids support
- Landed https://github.com/vllm-project/vllm/pull/39937 Multiple prompt logprobs support

Large Scaling Serving

Under review: https://github.com/vllm-project/vllm/pull/38850 Optimize DCP communication with reusable collective scratch buffers, 1.5%~4% E2E throughput improvement
Under review: https://github.com/vllm-project/vllm/pull/40841 Support a node-local external DP mode with a single vllm serve and aggregated admin health endpoint
Under review: https://github.com/vllm-project/vllm/pull/40839 Fix status update address for non-MOE model within external dp mode
Under review: https://github.com/vllm-project/vllm/pull/40174 Add fast all2all kernel, tested for DCP, 1.1% Throughput improvement
Under review: https://github.com/vllm-project/vllm/pull/39832 $KV Connector$ Remove compat support for pre-v0.12.0 constructor signatures without KVCacheConfig

Batch Invariant

Under review: https://github.com/vllm-project/vllm/pull/40408 Batch invariance with Cutlass fp8 support, 28.9% E2E latency improvement
Landed: https://github.com/vllm-project/vllm/pull/40413 Optimize batch invariant with fused rms norm, 2.1% E2E latency improvement
Landed: https://github.com/vllm-project/vllm/pull/39820 Fix batch invariance nvfp4 support

Pooling Model (issue https://github.com/vllm-project/vllm/issues/35631)

Under review: https://github.com/vllm-project/vllm/pull/41163 Optimize AllPool.forward by slicing first, 51% faster in the method level benchmark

vLLM Contributions

April 1 - April 14

GPU Model Runner V2:

Under review: https://github.com/vllm-project/vllm/pull/38390 E/P/D disaggregation support
- Under review: https://github.com/vllm-project/vllm/pull/35214 Optimize Sampler Redundant Copy for Model Runner v2, 1.8% Throughput Improvement
- Under review: https://github.com/vllm-project/vllm/pull/35206 Support Sequence Parallel for Model Runer v2 (Piecewise Cudagraph, PP=1)
- Under review: https://github.com/vllm-project/vllm/pull/39337 Oracle for model runner v2 - dense model by default $1/N$
- Landed https://github.com/vllm-project/vllm/pull/39353 Fix flex attention kv blocks calculation issue

Large Scaling Serving

Under review: https://github.com/vllm-project/vllm/pull/38850 Optimize DCP communication with reusable collective scratch buffers, 1.5%~4% E2E throughput improvement

Batch Invariant

Landed: https://github.com/vllm-project/vllm/pull/39320 Fix batch invariant test issue, bs=1 with max_seq_num = 1
Landed https://github.com/vllm-project/vllm/pull/39322 batch invariance nvfp4 support linear
Guide community to work on batch invariance, eg. https://github.com/vllm-project/vllm/pull/39912
And more…

Pooling Model (issue https://github.com/vllm-project/vllm/issues/35631)

Under review: https://github.com/vllm-project/vllm/pull/39533 Batched projector for pooling model embed, 1.8% throughput improvement
Landed: https://github.com/vllm-project/vllm/pull/39113 Optimize redundant sync for pooling model, 3.7% Throughput Improvement

vLLM Contributions

March 18 - March 31

News: https://vllm.ai/blog/mrv2 first released!

GPU Model Runner V2:

Under review: https://github.com/vllm-project/vllm/pull/38390 E/P/D disaggregation support
- Under review: https://github.com/vllm-project/vllm/pull/35214 Optimize Sampler Redundant Copy for Model Runner v2, 1.8% Throughput Improvement
- Under review: https://github.com/vllm-project/vllm/pull/35206 Support Sequence Parallel for Model Runer v2 (Piecewise Cudagraph, PP=1)
- Landed: https://github.com/vllm-project/vllm/pull/37488 EPLB Support for GPU Model Runner v2

Large Scaling Serving

Under review: https://github.com/vllm-project/vllm/pull/38287 Skip kv connector empty work, Around 1% Throughput Improvement
Landed: https://github.com/vllm-project/vllm/pull/38383 Remove dead code in kv connector and model runner

Batch Invariant

Under review: https://github.com/vllm-project/vllm/pull/38039 Fix batch invariance for offline serving and aux stream
Landed: https://github.com/vllm-project/vllm/pull/38014 Add batch invariant test for b200
Landed: https://github.com/vllm-project/vllm/pull/37895 Add batch invariant test: Block FP8 + small MOE
Landed: https://github.com/vllm-project/vllm/pull/37718 Fix fp8 deepgemm batch invariant
And more…

Kernel Optimization

Landed: https://github.com/vllm-project/vllm/pull/37340 Add tuned triton moe config for Qwen3.5 H200, 9.9% E2E throughput improvement

Pooling Model (issue https://github.com/vllm-project/vllm/issues/35631)

Landed: https://github.com/vllm-project/vllm/pull/37347 Optimize token_embed for pooling models, 1.0% token throughput improvement
Landed: https://github.com/vllm-project/vllm/pull/38559 Optimize mean pooling using chunks and index_add, 5.9% E2E throughput improvement
Landed: https://github.com/vllm-project/vllm/pull/38139 Remove redundant device copies for CPU-only pooling token IDs, 48.9% E2E throughput improvement
And more…

vLLM Contributions

Refactoring PRs
Bug Fix PRs:

March 3 - March 17

GPU Model Runner V2:

Under review: https://github.com/vllm-project/vllm/pull/35214 Optimize Sampler Redundant Copy for Model Runner v2, 1.8% Throughput Improvement
- Under review: https://github.com/vllm-project/vllm/pull/35206 Support Sequence Parallel for Model Runer v2 (Piecewise Cudagraph, PP=1)
- Under review: https://github.com/vllm-project/vllm/pull/37195 $V0 Deprecation$ Deprecate virtual engine

Large Scaling Serving:

Landed: https://github.com/vllm-project/vllm/pull/35781 Optimize scheduler overhead for PD disaggregation, around 5% E2E perf improvement
Landed: https://github.com/vllm-project/vllm/pull/36424 Remove dead code in KV connector
Landed: https://github.com/vllm-project/vllm/pull/36170 Remove default ray dependency

Kernel Optimization:

Under review: https://github.com/vllm-project/vllm/pull/37340 Add tuned triton moe config for Qwen3.5 H200, 9.9% E2E throughput improvement

Pooling Model:

Under review: https://github.com/vllm-project/vllm/pull/37347 Optimize token_embed for pooling models, 2.8% token throughput improvement
Landed: https://github.com/vllm-project/vllm/pull/36710 Optimize compute maxsim using batched version, 3.2% E2E throughput improvement
Landed: https://github.com/vllm-project/vllm/pull/36159 Compute maxsim in worker side, reducing redundant copies, 2.7% E2E throughput improvement
Landed: https://github.com/vllm-project/vllm/pull/35427 Fix maxsim cuda platform and add cli to control it

Other Contributions:

Refactoring PRs
Bug Fix PRs:

Feb 18 - March 2

GPU Model Runner V2:

https://github.com/vllm-project/vllm/pull/35333 Optimize model runner v2 prepare_inputs copy logic, 6.1% E2E throughput improvement. Nick has a PR after https://github.com/vllm-project/vllm/pull/35561
https://github.com/vllm-project/vllm/pull/35214 Optimize Sampler Redundant Copy for Model Runner v2, 1.8% Throughput Improvement
https://github.com/vllm-project/vllm/pull/35206 Support Sequence Parallel for Model Runer v2 (Piecewise Cudagraph, PP=1)
https://github.com/vllm-project/vllm/pull/34903 Fix illegal memory access issue for model runner v2

Async Scheduling:

Under review: https://github.com/vllm-project/vllm/pull/34029 Optimize async scheduling redundant copy, 0.9% E2E throughput improvement
Under review: Optimize sampled_token_ids using numpy and remove tolist, 0.9% E2E throughput improvement https://github.com/vllm-project/vllm/pull/35446

Large Scaling Serving:

Under review: https://github.com/vllm-project/vllm/pull/35781 Optimize scheduler overhead for PD disaggregation, around 5% E2E perf improvement

Pooling Model:

Landed: https://github.com/vllm-project/vllm/pull/35427 $Refactor$ Fix maxsim cuda platform and add cli to control it
Landed: https://github.com/vllm-project/vllm/pull/35330 Optimize maxsim scores computation for pooling models, 13.9% E2E throughput improvement
Landed: https://github.com/vllm-project/vllm/pull/35127 Optimize pooling model redundant copy, 1.8% throughput improvement

Other Contributions:

Refactoring PRs
Bug Fix PRs:
- https://github.com/vllm-project/vllm/pull/35314
- https://github.com/vllm-project/vllm/pull/34961
Lead Fix all of the mypy check, issue in #26533
Batch invariant: Lead and track all progress in https://github.com/vllm-project/vllm/issues/27433

Feb 4 - Feb 17

GPU Model Runner V2:

Landed: https://github.com/vllm-project/vllm/pull/34179 $Feature$ Decode Context Parallel support for GPU model runner v2
- Co-authored with Summer and landed: https://github.com/vllm-project/vllm/pull/33960 Pipeline Parallel support for Model Runner V2 (git diff shared)

Async Scheduling:

Under review: https://github.com/vllm-project/vllm/pull/34029 Optimize async scheduling redundant copy, 0.9% E2E throughput improvement
Landed: https://github.com/vllm-project/vllm/pull/32975 Optimize detokenizer python logic
Landed: https://github.com/vllm-project/vllm/pull/33612 Optimize spec decoding + async scheduling, 1.5% Throughput improvement

Performance optimizations:

Landed: https://github.com/vllm-project/vllm/pull/33449 Remove align block size logic in moe_permute
Landed: https://github.com/vllm-project/vllm/pull/33368 Pipeline Parallel Async send/recv, 2.9% E2E throughput improvement

Batch Invariant:

Lead and track all progress in https://github.com/vllm-project/vllm/issues/27433
Leading for Community contributions

Other Contributions:

Refactoring PRs
Bug Fix PRs:
- https://github.com/vllm-project/vllm/pull/33998

Jan 21 - Feb 3

Async Scheduling:

Done: $Feature$ $F e a t u re$ : Async Scheduling + Pipeline Parallel Support (V1) https://github.com/vllm-project/vllm/issues/32701
- Landed https://github.com/vllm-project/vllm/pull/32618 Fully support for async scheduling + PP, 30.8% E2E throughput improvement, 31.8% TPOT improvement
Other Optimizations
- Under review: https://github.com/vllm-project/vllm/pull/32975 Optimize detokenizer python logic
- Under review: https://github.com/vllm-project/vllm/pull/33612 Optimize spec decoding + async scheduling, 1.5% Throughput improvement