Summary for paper ‘FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning’
Technical notes during 2025 (3).
Summary for paper ‘FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness’
Summary for paper ‘DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving’
Summary for paper ‘Fast Inference from Transformers via Speculative Decoding’
Summary for paper ‘MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation’
Summary for paper ‘Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing’
Summary for paper ‘Efficient Memory Management for Large Language Model Serving with PagedAttention’
Summary for paper ‘Incentivizing Reasoning Capability in LLMs via Reinforcement Learning’
yewentao
published on 2025-04-07 included in category Tvm This blog demonstrates optimization techniques for 2D depth Convolution in GPU using TVM, including block and thread organization, memory hierarchy exploitation and dimension fuse, etc.