Technical notes during 2025 (3).
Summary for paper ‘FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness’
Summary for paper ‘DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving’
Summary for paper ‘Fast Inference from Transformers via Speculative Decoding’
Summary for paper ‘MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation’
Summary for paper ‘Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing’
Summary for paper ‘Efficient Memory Management for Large Language Model Serving with PagedAttention’
Summary for paper ‘Incentivizing Reasoning Capability in LLMs via Reinforcement Learning’
yewentao
published on 2025-04-07 included in category Tvm This blog demonstrates optimization techniques for 2D depth Convolution in GPU using TVM, including block and thread organization, memory hierarchy exploitation and dimension fuse, etc.
yewentao
published on 2025-04-06 included in category Tvm This blog demonstrates optimization techniques for GEMM in GPU using TVM, including thread organization and memory hierarchy exploitation.