Wentao's Blog

2025 Technical Notes（3）

yewentao published on 2025-05-23 included in category Technical_notes

Technical notes during 2025 (3).

Summary: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

yewentao published on 2025-05-22 included in category Paper_summary

Summary for paper ‘FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness’

Summary: DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

yewentao published on 2025-05-17 included in category Paper_summary

Summary for paper ‘DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving’

Summary: Fast Inference from Transformers via Speculative Decoding

yewentao published on 2025-05-11 included in category Paper_summary

Summary for paper ‘Fast Inference from Transformers via Speculative Decoding’

Summary: MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

yewentao published on 2025-04-29 included in category Paper_summary

Summary for paper ‘MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation’

Summary: Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing

yewentao published on 2025-04-27 included in category Paper_summary

Summary for paper ‘Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing’

Summary: Efficient Memory Management for Large Language Model Serving with PagedAttention

yewentao published on 2025-04-17 included in categories Paper_summary Vllm

Summary for paper ‘Efficient Memory Management for Large Language Model Serving with PagedAttention’

Summary: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

yewentao published on 2025-04-13 included in category Paper_summary

Summary for paper ‘Incentivizing Reasoning Capability in LLMs via Reinforcement Learning’

TVM: 2D Depth Conv GPU Optimization

yewentao published on 2025-04-07 included in category Tvm

This blog demonstrates optimization techniques for 2D depth Convolution in GPU using TVM, including block and thread organization, memory hierarchy exploitation and dimension fuse, etc.

TVM: GEMM GPU Optimization

yewentao published on 2025-04-06 included in category Tvm

This blog demonstrates optimization techniques for GEMM in GPU using TVM, including thread organization and memory hierarchy exploitation.