Wentao's Blog

Summary: DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

yewentao published on 2025-05-17 included in category Paper_summary

Summary for paper ‘DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving’

Summary: Fast Inference from Transformers via Speculative Decoding

yewentao published on 2025-05-11 included in category Paper_summary

Summary for paper ‘Fast Inference from Transformers via Speculative Decoding’

Summary: MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

yewentao published on 2025-04-29 included in category Paper_summary

Summary for paper ‘MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation’

Summary: Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing

yewentao published on 2025-04-27 included in category Paper_summary

Summary for paper ‘Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing’

Summary: Efficient Memory Management for Large Language Model Serving with PagedAttention

yewentao published on 2025-04-17 included in categories Paper_summary Vllm

Summary for paper ‘Efficient Memory Management for Large Language Model Serving with PagedAttention’

Summary: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

yewentao published on 2025-04-13 included in category Paper_summary

Summary for paper ‘Incentivizing Reasoning Capability in LLMs via Reinforcement Learning’

TVM: 2D Depth Conv GPU Optimization

yewentao published on 2025-04-07 included in category Tvm

This blog demonstrates optimization techniques for 2D depth Convolution in GPU using TVM, including block and thread organization, memory hierarchy exploitation and dimension fuse, etc.

TVM: GEMM GPU Optimization

yewentao published on 2025-04-06 included in category Tvm

This blog demonstrates optimization techniques for GEMM in GPU using TVM, including thread organization and memory hierarchy exploitation.

TVM: 1D convolution GPU Optimization

yewentao published on 2025-04-03 included in category Tvm

This blog demonstrates optimization techniques for 1D GPU convolution using TVM, including thread organization, memory hierarchy exploitation, and low-level optimizations.

Summary: ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

yewentao published on 2025-04-02 included in category Paper_summary

Summary for paper ‘ZeRO: Memory Optimizations Toward Training Trillion Parameter Models’