This document presents a structured approach for constructing a web proxy, supports multi-threading for concurrent request handling, and implements an in-memory cache utilizing a Least Recently Used (LRU) eviction policy.
This article offers an insightful look into dtype promotion in PyTorch, explaining how different data types are handled during tensor operations. It covers the fundamental rules of dtype promotion, the specifics of how scalar values are integrated into tensor operations, and the role of TensorIterator in computing dtypes.
This article provides an in-depth examination of the Structured Kernel and TensorIterator in PyTorch, key components for optimizing tensor operations. We will delve into the implementation aspects, including op declaration, meta
and impl
steps in Structured Kernel, and the construction and computation processes in TensorIterator.
Pytorch Compiler Introduction
This article explores the basic concepts of Cuda Stream, parallel execution, and multi-GPU synchronization strategies. We analyze the advantages of using multiple CUDA streams and how to ensure task synchronization through Cuda Event, utilizing Cuda streams to optimize program performance.
This document provides a comprehensive overview of distributed training capabilities within PyTorch. Covering the core components of torch.distributed
, it delves into Distributed Data-Parallel Training (DDP), RPC-Based Distributed Training, and Collective Communication (c10d).
In malloc lab, we will implement our own versions of malloc
, free
, and realloc
.
This article introduces the implementation details of pytorch broadcast mechanism, including the forward and backward calculation.
This article dissects PyTorch’s C++ core to uncover the mechanics of tensor indexing and assignment. From translating Python indices to C++ TensorIndex to the nuances of handleDimInMultiDimIndexing
, we explore both basic and advanced tensor operations.
This article introduces the implementation details of pytorch autograd mechanism.