Home avatar

Wentao's Blog

Exploring Structured Kernel and Tensor Iterator in PyTorch

This article provides an in-depth examination of the Structured Kernel and TensorIterator in PyTorch, key components for optimizing tensor operations. We will delve into the implementation aspects, including op declaration, meta and impl steps in Structured Kernel, and the construction and computation processes in TensorIterator.

Pytorch Cuda Streams Introduction

This article explores the basic concepts of Cuda Stream, parallel execution, and multi-GPU synchronization strategies. We analyze the advantages of using multiple CUDA streams and how to ensure task synchronization through Cuda Event, utilizing Cuda streams to optimize program performance.

Overview of PyTorch Distributed Training

This document provides a comprehensive overview of distributed training capabilities within PyTorch. Covering the core components of torch.distributed, it delves into Distributed Data-Parallel Training (DDP), RPC-Based Distributed Training, and Collective Communication (c10d).

ShellLab

In shell lab, we’ll become more familiar with the concepts of process control and signal by writing a simple Unix shell program that supports job control. Source: [https://github.com/yewentao256/CSAPP_15213/tree/main/shelllab]