About Me
Contents
Hello Friend, welcome to my blog!
Contact:
- Email:
zhyanwentao@outlook.com - Github: yewentao256
- LinkedIn: Wentao Ye
1. Patent
- One-iter Tool (CN117312173A), patented in 2023, reducing model accuracy validation time from hours to minutes.
2. Awards
- 100k Cornell Startup Award
- PolyRook: Fast 3D environment generation.
- National Third Prize
- China University Computer Contest, WeChat Big Data Challenge, 2021
- Rank 80 / 6,768 teams
- Certificate
- National Third Prize
- China University Computer Contest, Huawei Big Data Challenge, 2020
- Rank 27 / 1,491 teams
- Certificate
- National Second Prize
- China Service Outsourcing Innovation & Entrepreneurship Competition, 2020
- Top 1% / 6417 teams
- Certificate
- National First Prize
- China University Computer Capability Challenge, 2019
- Runner-up / 414 teams
- Certificate
3. Experience
Machine Learning Engineer
- Red hat
- Jun 2025 - Current
Deep Learning Engineer
- SenseTime | SenseCore
- Jul 2022 - Aug 2024
R&D Intern
- SenseTime | Research Institute (Deep Learning Frameworks)
- Jan 2021 - Jul 2022
Co-founder & CTO
- Wuhan Hongyuan Investment & Technology Services Co., Ltd.
- Nov 2019 - Sep 2020
Co-founder
- Yuye (Wuhan) Technology Development Co., Ltd.
- Jun 2019 - Nov 2019
4. Education
Master of Computer Science
- Cornell University | New York, USA
- May 2025
- GPA: 4.21/4.0 (4.3 for A+)
Bachelor of Software Engineering
- Wuhan University | Wuhan, China
- Jun 2022
- GPA: 3.91/4.0
- National Scholarship (top 1%)
5. Selected OpenSource Projects
Contributor
PyTorch
- May 2023 - Present
- Optimized the cuDNN Convolution and cuDNN BatchNorm operators, achieving a 15% performance boost in CNN training and inference for computer vision tasks
- 30+ contributions to Pytorch.
- Authored a blog series with 15+ articles**, providing the developer community with insights into PyTorch’s core architecture and optimizations.
- Details at My Contributions
Maintainer
vLLM
- Jun 2025 - Current
- Code owner for quantization, batch-invariant execution, caching, weight loading and CUDA kernels
- Led design and implementation of batch-invariant, showcased in the vLLM blog and mentioned at PyTorch Conference 2025.
- Optimized MoE shared-expert overlap scheduling, improving end-to-end throughput by ~6% and reducing time-to-first-token latency by 25%+.
- Integrated and tuned DeepGEMM on B200/H100 GPUs, delivering ~11% throughput gains on B200 and ~6% on H100 while preserving accuracy; Shipped DeepSeek V3.2 support in one week.
- Developed and optimized low-precision quantization kernels (INT8/FP8) for LLM inference, speeding up models by ~13% on H100 and FP8 by ~7% on B200 without accuracy loss.
- Details at My Contributions and Bi-weekly Journal
LazyLLM
- May 2024 - Aug 2024
- Built a Retrieval-Augmented Generation (RAG) system with a specialized tree architecture, which improved query performance by 50% over LlamaIndex by enhancing the efficiency of parent/child node retrieval.
- Details at My Contributions
MMCV & PAVI Logger
- Jan 2021 - Dec 2022
- Rebuilt the PAVI data collection SDK, achieving a 10× improvement in data upload efficiency through optimized parallel processing, significantly reducing ingestion time and enhancing performance for large-scale datasets.
- Integrated the proprietary PAVI Logger system into the MMCV library, enabling efficient and customizable logging for deep learning workflows, with the core system remaining private.
DeepLink & DIOPI
- Apr 2023 - May 2024
- Optimized Llama 2-70B training on 1024 NPUs by integrating distributed training strategies (ZeRO, Tensor Parallel, Pipeline Parallel) and operator-level optimizations. Achieved a 700% increase in TGS (Tokens/GPU/Second) and significantly boosted LLM performance.
- Details at Deeplink and DIOPI
Owner
GAN-Paint
- Nov 2024 - Jan 2025
- Developed a lightweight GAN (generative adversarial network) for large-area image completion and cross-scene stitching, achieving realistic outputs on a single RTX 2070 GPU.
- Implemented an end-to-end training pipeline with efficient data preprocessing, masking strategies, and evaluation, completing model training within hours.
MicroTorch
- Jun 2023 - Aug 2024
- Developed a minimalistic deep learning framework inspired by PyTorch, implementing core functionalities such as AutoGrad, dynamic computation graphs, and tensor operations.
- Designed to be lightweight and modular, making it ideal for educational purposes, with extensive examples to facilitate learning.
CMU CSAPP
- Dec 2022 - Feb 2024
- Self-studied the CMU CSAPP-15213 course and completed its associated labs, covering core concepts such as assembly optimization, multi-level cache, compiling and linking, exception control flow, virtual memory, and system-level I/O.
- Blogs
TinyNN
- Nov 2022 - Dec 2022
- Built TinyNN, a minimal implementation of Fully Connected Neural Networks and Convolutional Neural Networks, designed for educational and experimental purposes.
You After Taking Drugs
- Aug 2021
- Independently developed this system in 7 days using computer vision algorithms; optimized for smooth performance on a single i3 CPU, ensuring a seamless user experience and earning client approval in the first review.
- Software Copyright: “After Taking Drugs (Facial Human Morphing Experience)” (2022SR0021854).
Sicpy Compiler
- Nov 2020 - Dec 2020
- Designed and implemented an untyped programming language Sicpy and its corresponding compiler using flex and bison.
- Developed features including lexical, syntax, and semantic analysis, as well as type inference and automatic garbage collection via reference counting, providing a complete custom language framework for functional and imperative programming experimentation.
New Super Mario
- Apr 2020
- Group project with Jifeng Wu, Jinran Tang and Taihe Li.

