Pytorch Compiler Introduction
This article explores the basic concepts of Cuda Stream, parallel execution, and multi-GPU synchronization strategies. We analyze the advantages of using multiple CUDA streams and how to ensure task synchronization through Cuda Event, utilizing Cuda streams to optimize program performance.
This document provides a comprehensive overview of distributed training capabilities within PyTorch. Covering the core components of torch.distributed
, it delves into Distributed Data-Parallel Training (DDP), RPC-Based Distributed Training, and Collective Communication (c10d).
In malloc lab, we will implement our own versions of malloc
, free
, and realloc
.
This article introduces the implementation details of pytorch broadcast mechanism, including the forward and backward calculation.
This article dissects PyTorch’s C++ core to uncover the mechanics of tensor indexing and assignment. From translating Python indices to C++ TensorIndex to the nuances of handleDimInMultiDimIndexing
, we explore both basic and advanced tensor operations.
This article introduces the implementation details of pytorch autograd mechanism.
This article introduces the implementation details of pytorch autograd mechanism.
In shell lab, we’ll become more familiar with the concepts of process control and signal by writing a simple Unix shell program that supports job control. Source: [https://github.com/yewentao256/CSAPP_15213/tree/main/shelllab]
Uncover the inner workings of PyTorch through a deep dive into the contiguous
operator, from its Python interface to its dispatching and registration process, and finally how it is executed.