Contents

About Me

Hello Friend, welcome to my blog!


Contact:


  • One-iter Tool (CN117312173A), patented in 2023, reducing model accuracy validation time from hours to minutes.

  • 100k Cornell Startup Award
    • PolyRook: Fast 3D environment generation.
  • National Third Prize
    • China University Computer Contest, WeChat Big Data Challenge, 2021
    • Rank 80 / 6,768 teams
    • Certificate
  • National Third Prize
    • China University Computer Contest, Huawei Big Data Challenge, 2020
    • Rank 27 / 1,491 teams
    • Certificate
  • National Second Prize
    • China Service Outsourcing Innovation & Entrepreneurship Competition, 2020
    • Top 1% / 6417 teams
    • Certificate
  • National First Prize
    • China University Computer Capability Challenge, 2019
    • Runner-up / 414 teams
    • Certificate

  • Machine Learning Engineer

    • Red hat
    • Jun 2025 - Current
  • Deep Learning Engineer

    • SenseTime | SenseCore
    • Jul 2022 - Aug 2024
  • R&D Intern

    • SenseTime | Research Institute (Deep Learning Frameworks)
    • Jan 2021 - Jul 2022
  • Co-founder & CTO

    • Wuhan Hongyuan Investment & Technology Services Co., Ltd.
    • Nov 2019 - Sep 2020
  • Co-founder

    • Yuye (Wuhan) Technology Development Co., Ltd.
    • Jun 2019 - Nov 2019

  • Master of Computer Science

    • Cornell University | New York, USA
    • May 2025
    • GPA: 4.21/4.0 (4.3 for A+)
  • Bachelor of Software Engineering

    • Wuhan University | Wuhan, China
    • Jun 2022
    • GPA: 3.91/4.0
    • National Scholarship (top 1%)

  • GitHub stars
  • Jun 2025 - Current
  • Code owner for quantization, batch-invariant execution, caching, weight loading and CUDA kernels
  • Led design and implementation of batch-invariant, showcased in the vLLM blog and mentioned at PyTorch Conference 2025.
  • Optimized MoE shared-expert overlap scheduling, improving end-to-end throughput by ~6% and reducing time-to-first-token latency by 25%+.
  • Integrated and tuned DeepGEMM on B200/H100 GPUs, delivering ~11% throughput gains on B200 and ~6% on H100 while preserving accuracy; Shipped DeepSeek V3.2 support in one week.
  • Developed and optimized low-precision quantization kernels (INT8/FP8) for LLM inference, speeding up models by ~13% on H100 and FP8 by ~7% on B200 without accuracy loss.
  • Details at My Contributions and Bi-weekly Journal
  • GitHub stars
  • May 2024 - Aug 2024
  • Built a Retrieval-Augmented Generation (RAG) system with a specialized tree architecture, which improved query performance by 50% over LlamaIndex by enhancing the efficiency of parent/child node retrieval.
  • Details at My Contributions
  • GitHub stars
  • Jan 2021 - Dec 2022
  • Rebuilt the PAVI data collection SDK, achieving a 10× improvement in data upload efficiency through optimized parallel processing, significantly reducing ingestion time and enhancing performance for large-scale datasets.
  • Integrated the proprietary PAVI Logger system into the MMCV library, enabling efficient and customizable logging for deep learning workflows, with the core system remaining private.
  • Apr 2023 - May 2024
  • Optimized Llama 2-70B training on 1024 NPUs by integrating distributed training strategies (ZeRO, Tensor Parallel, Pipeline Parallel) and operator-level optimizations. Achieved a 700% increase in TGS (Tokens/GPU/Second) and significantly boosted LLM performance.
  • Details at Deeplink and DIOPI
  • Nov 2024 - Jan 2025
  • Developed a lightweight GAN (generative adversarial network) for large-area image completion and cross-scene stitching, achieving realistic outputs on a single RTX 2070 GPU.
  • Implemented an end-to-end training pipeline with efficient data preprocessing, masking strategies, and evaluation, completing model training within hours.
  • Jun 2023 - Aug 2024
  • Developed a minimalistic deep learning framework inspired by PyTorch, implementing core functionalities such as AutoGrad, dynamic computation graphs, and tensor operations.
  • Designed to be lightweight and modular, making it ideal for educational purposes, with extensive examples to facilitate learning.
  • Dec 2022 - Feb 2024
  • Self-studied the CMU CSAPP-15213 course and completed its associated labs, covering core concepts such as assembly optimization, multi-level cache, compiling and linking, exception control flow, virtual memory, and system-level I/O.
  • Blogs
  • Nov 2022 - Dec 2022
  • Built TinyNN, a minimal implementation of Fully Connected Neural Networks and Convolutional Neural Networks, designed for educational and experimental purposes.
  • Aug 2021
  • Independently developed this system in 7 days using computer vision algorithms; optimized for smooth performance on a single i3 CPU, ensuring a seamless user experience and earning client approval in the first review.
  • Software Copyright: “After Taking Drugs (Facial Human Morphing Experience)” (2022SR0021854).
  • Nov 2020 - Dec 2020
  • Designed and implemented an untyped programming language Sicpy and its corresponding compiler using flex and bison.
  • Developed features including lexical, syntax, and semantic analysis, as well as type inference and automatic garbage collection via reference counting, providing a complete custom language framework for functional and imperative programming experimentation.
  • Apr 2020
  • Group project with Jifeng Wu, Jinran Tang and Taihe Li.

gif

gif