2025 Technical Notes(1)
The content in this page here is translated by O3-mini-high.
Robotic
Reinforcement Learning (RL)
Reinforcement Learning (RL) is a branch of machine learning where agents learn to take actions in an environment to maximize long-term rewards. One popular algorithm is Proximal Policy Optimization (PPO), a deep RL method based on policy optimization.
Value-Based Methods
These algorithms evaluate the quality of each action and select the best one to improve the policy. For example, in a maze scenario, if moving left has a high value while moving right leads into a trap, an algorithm like Q-Learning—which records the value for every action at each state—will choose the optimal move. Although simple and effective for small-scale discrete problems, value-based methods can suffer from greedy choices that may miss the global optimum, and they struggle with continuous action spaces (e.g., precise joint angles in robots).
Note: This approach differs from imitation learning; it does not require expert demonstrations but learns solely through interactions with the environment.
Policy-Based Methods
In contrast, policy-based methods directly parameterize the policy. The agent executes actions to receive rewards, aiming to maximize the long-term return. In robotics, pure RL can be both costly and risky—imagine programming a robot with the idea “I tend to avoid obstacles” or “I prefer to open treasure chests.” Policy-based methods are well-suited for continuous actions and offer greater flexibility, though they often suffer from lower sample efficiency and high variance during training.
A popular policy-based algorithm is PPO. In practice, the Actor-Critic framework is often used, where the critic evaluates the actions and the actor generates them. Furthermore, pretraining is crucial in RL—starting from scratch can be prohibitively time-consuming and even dangerous (especially for robots). Thus, using offline or unsupervised pretraining can significantly accelerate the learning process.
Imitation Learning (IL)
Imitation Learning (IL)—or learning from demonstrations—involves mimicking expert data. This approach is particularly useful in domains such as robotic arm manipulation, service robotics, and autonomous driving. However, pure imitation learning often lacks generalization capabilities.
In practice, a common strategy is to first pretrain using expert demonstrations (e.g., via Behavior Cloning, where a network is trained in a supervised fashion to predict actions from states) and then fine-tune using reinforcement learning or even combine both strategies concurrently.
Reinforcement Learning from Human Feedback (RLHF)
RLHF extends beyond robotics into broader deep learning applications. Its typical workflow is:
- Supervised Pretraining: Initialize the model using supervised learning.
- Collect Human Feedback: Gather feedback data where humans rate model outputs.
- Train a Reward Model: Use the human scores as targets for a reward model that takes the LLM’s output and produces a reward.
- RL Fine-Tuning: Utilize algorithms like PPO, with the reward model acting as the reward function, to fine-tune the LLM.
- Iterate: Continuously collect data and refine the model.
Linear Algebra
Matrix Eigenvalues: An eigenvalue is a scaling factor by which a matrix stretches or compresses a vector in the direction of its corresponding eigenvector. For instance, consider the matrix
.
Its eigenvectors are and , with eigenvalues 2 (scaling along the x-axis) and 3 (scaling along the y-axis), respectively.
Quadratic Equation:
The discriminant is given by:discriminant = b^2 - 4ac
and the solutions for a quadratic equation are:x = (-b ± √discriminant) / (2a)
Distributed Training
RDMA
Remote Direct Memory Access (RDMA) enables direct memory access from the memory of one computer into that of another without involving the CPU, which significantly reduces latency. Common RDMA technologies include:
- InfiniBand (IB)
- RoCE (RDMA over Converged Ethernet)
Within a single server, high-speed interconnects like NVLink are used between GPUs, while RDMA (e.g., GPUdirect RDMA) is preferred for multi-node communication.
Deep Learning
Fundamentals
Pruning
Pruning reduces redundant weights (or even entire layers) in a network:
- Weight Magnitude-Based: Small weights are pruned.
- Gradient Magnitude-Based: Weights with minimal gradients (i.e., minimal impact on the loss) are pruned.
- Activation-Based: Low activations indicate little contribution and may be pruned.
After pruning (where pruned values are set to zero, making matrices sparse), specialized optimizations can reduce storage and computation. Since pruning can degrade performance, fine-tuning is often required in iterative cycles.
Distillation
Knowledge distillation involves transferring knowledge from a larger “teacher” model to a smaller “student” model using soft targets.
ONNX Runtime
ONNX Runtime supports multiple backends (CPU, GPU, TensorRT) and unifies different model formats for streamlined inference.
Object Detection Paradigms
- RCNN: Uses selective search to propose candidate regions, then classifies them.
- Fast-RCNN: Scans the entire image to generate feature maps and then extracts candidate boxes.
- Faster-RCNN: Uses neural networks to generate proposals directly.
For 3D detection, sensor fusion (e.g., combining camera data with LiDAR) is used. Although methods like NeRF focus on 3D reconstruction rather than bounding box detection, they serve as a foundation for later stages.
Finetuning
Finetuning involves freezing certain layers of a pretrained model and training the remaining layers on new data to adapt to specific tasks.
Loss Functions
For example, if
y = [0, 1, 0]
andy_softmax = [0.1, 0.8, 0.1]
, the loss is approximately 0.223.where is the probability output after a sigmoid activation.
Low-Rank Decomposition
Low-rank decomposition approximates weight matrices by factorizing them (e.g., via Singular Value Decomposition), reducing both storage and computational costs.
Out-Operator Optimization
The “out operator” can improve speed by:
- Reducing runtime malloc calls.
- Minimizing extra copies of data.
While PyTorch’s memory management minimizes malloc overhead, pre-allocating memory for out operators—especially during communication—can lead to significant performance gains. Note that CUDA malloc calls remain expensive.
Quantization (TensorRT)
Quantization reduces model precision to speed up inference and reduce memory usage. There are two main approaches:
Post-Training Quantization (PTQ)
- Export the Model: Convert the trained model to ONNX format. Ensure that all operators are supported (or implement custom plugins for unsupported ones, such as flash attention or RMSNorm in LLMs).
- Build the TensorRT Engine: Serialize the engine into a
.plan
file or load it directly for inference. - Calibration: Because int8 values range from -128 to 127 while floating-point values can be much larger, calibration is needed. A scale factor is computed so that:
Calibration methods include:
- MinMax Calibration: Uses the minimum and maximum values, though it is sensitive to outliers.
- Entropy Calibration: Minimizes KL divergence between the original and quantized distributions.
- Percentile Strategy: Discards extreme values (e.g., the 99.9th percentile) to set calibration bounds.
Quantization-Aware Training (QAT)
QAT integrates quantization into the training process by simulating quantization error during forward passes while using techniques like fake quantization and the Straight-Through Estimator (STE) for backpropagation. The typical workflow is:
- Fake Quantization: Simulate quantization by:
- Quantize:
(The output remains in float format.) - Dequantize:
- Quantize:
- STE: During backpropagation, non-differentiable operations like rounding are assumed to have a derivative of 1.
For instance, an original activationx = [0.45, -1.23, 2.99, 0.02]
might be quantized (fakely) to [5, -12, 30, 0]
and then dequantized to approximately [0.5, -1.2, 3.0, 0.0]
, thereby simulating the error introduced by quantization.
Common Issues
Gradient Vanishing:
Common causes include:
- Excessive network depth, which can be mitigated by using proper network designs and residual connections.
- Activation function saturation (e.g., ReLU zeroing negatives or sigmoid saturating for extreme values).
Mitigation strategies: - Use alternative activations such as Leaky ReLU.
- Apply normalization techniques (e.g., BatchNorm).
- Employ appropriate weight initialization methods:
- Xavier Initialization: Suitable for sigmoid/tanh, adjusts the initial range based on neuron count.
- He Initialization: Better for ReLU networks due to its consideration of the non-linear activation’s behavior.
Handling Unlabeled Data:
For scenarios like determining whether a video relates to restaurants:
- Use simple rule-based methods to extract weak labels (keywords like “restaurant”, “menu”, “food”).
- Extract features:
- Use a pretrained classification model on video frames to detect food, menus, or dining scenes.
- Use multimodal models (e.g., CLIP) to extract embeddings from video frames and evaluate restaurant-related scores.
- Alternatively, use unsupervised clustering to group data based on frame content and audio, then manually label clusters.
Once initial pseudo-labels are generated, a preliminary model can be trained and iteratively improved via pseudo-labeling.
LogSoftmax Forward and Backward
Here is an example implementation using NumPy:
import numpy as np
class LogSoftmax:
def __call__(self, x: np.ndarray) -> np.ndarray:
assert len(x.shape) == 2, "x should have shape (batch_size, in_dim)"
# log_softmax = x - log(sum(exp(x))) = x - (m + log(sum(exp(x-m))))
# where m = max(x, axis=1, keepdims=True)
max_val = np.max(x, axis=1, keepdims=True)
self.exp = np.exp(x - max_val)
self.sum_exp = np.sum(self.exp, axis=1, keepdims=True)
log_value = np.log(self.sum_exp)
return x - (max_val + log_value)
def bprop(self, x: np.ndarray, dedy: np.ndarray) -> np.ndarray:
softmax = self.exp / self.sum_exp
sum_dedy = np.sum(dedy, axis=1, keepdims=True)
dedx = dedy - softmax * sum_dedy
return dedx
Computer Vision (CV)
Diffusion Models
Denoising Diffusion Probabilistic Models (DDPM)
Introduced in 2020, DDPMs gradually add noise in a forward process and train a neural network to reverse this process using an MSE loss.
- Forward Process:
- Reverse Process:
is approximated as with additional random compensation.
Note: The noise level is controlled by a schedule (which increases linearly in DDPM). A downside is that reverse sampling can require hundreds of steps, making it slow.
Denoising Diffusion Implicit Models (DDIM)
Proposed in 2021, DDIM introduces deterministic reverse sampling:
- It removes the randomness inherent in DDPM.
- By skipping steps (e.g., reducing from 1000 to 50–100 steps), it greatly accelerates inference while maintaining image quality. Deterministic reverse sampling is achieved by formulating the process as an implicit ordinary differential equation (ODE).
Latent Diffusion Models (LDM)
LDMs further optimize the diffusion process by shifting it from high-dimensional image space to a lower-dimensional latent space:
- Use a pretrained Variational Autoencoder (VAE) to compress images.
- Perform the diffusion process in latent space, significantly reducing computational costs.
- Introduce conditioning (e.g., semantic maps, text, or images) during reverse denoising.
- Finally, use a decoder to reconstruct the image. While this method is faster, it is dependent on the quality of the VAE, which may result in some loss of fine details.
Stable Diffusion
Stable Diffusion is a specific implementation of latent diffusion models developed by Stability AI:
- It leverages CLIP (Contrastive Language–Image Pre-training) to extract text conditions that guide image generation, enabling strong multimodal capabilities.
- CLIP aligns images and captions into the same embedding space, allowing for controlled and semantically relevant generation.
Natural Language Processing (NLP)
Fundamentals2
GPT (Generative Pre-trained Transformer):
GPT models predict the next token in a sequence (left-to-right) using only the transformer’s decoder, making them well-suited for text generation. The pretraining objective is to model the probability of the next word given previous words.BERT (Bidirectional Encoder Representations from Transformers):
BERT employs a bidirectional encoder, which allows it to consider both left and right context simultaneously. This makes it ideal for understanding tasks. Pretraining involves:- Masked Language Modeling (MLM): Randomly mask tokens and predict them.
- Next Sentence Prediction (NSP): Determine if two sentences are contiguous.
T5 (Text-to-Text Transfer Transformer):
T5 uses both an encoder and a decoder, converting every task into a text-to-text format. Its pretraining task—Span Corruption—involves masking spans of text and training the model to recover them.
LLM Operators
into a single kernel call. Here, Q, K, and V have shapes (with being the sequence length). In multi-head attention, is divided by the number of heads.
Flash Attention:
Traditional attention mechanisms require storing an activation matrix, leading to quadratic memory usage. Flash Attention optimizes this by:- Dividing Q, K, and V into smaller blocks (e.g., a block size of 128).
- Computing local matrix multiplications and softmax operations within each block.
- Fusing these operations into a single GPU kernel, reducing memory usage from to .
For example, if Q, K, and V are of shape (4, 2), the standard approach computes a (4, 4) matrix, whereas flash attention splits them into (2, 2) blocks, processing them in chunks and combining the results—using techniques like the max-shift trick to maintain numerical stability.
and is a learnable scaling parameter.
Masked Softmax:
This operator integrates a mask into the softmax computation by adding a large negative value to the positions to be masked. It is particularly useful in attention mechanisms to prevent attending to future tokens (e.g., in GPT models).In high-dimensional space, the vector is split into pairs (each forming a 2D plane), and each pair is rotated by an angle (which may be a linear or logarithmic function of the position). Notably, the property helps the model learn relative positional information.
Transformer Architecture
The transformer is typically structured as follows:
- Encoder:
(words → embedding) + positional encoding →
[Multihead self-attention → Norm → Feedforward Network (with ReLU) + Residual Connection → Norm] repeated N times
- Decoder:
(words → embedding) + positional encoding →
[Masked Multihead self-attention (masking future tokens) + Norm → Cross-attention + Norm → Feedforward Network → Norm] repeated N times → Linear → Softmax
Why Self-Attention?
- RNNs process tokens sequentially and cannot be fully parallelized.
- CNNs can parallelize at the kernel level but require either many layers or large kernels to capture long-range dependencies, leading to high computational complexity.
- Self-attention enables direct interactions between all positions in a sequence and is highly parallelizable, especially when combined with multi-head attention.
LLama2 70B Training Example
For a 70-billion parameter model:
- Dataset: Approximately 2 trillion tokens (usually processed in one epoch).
- Sequence Length: Up to 4096 tokens.
- Global Batch Size:
- One iteration processes:
tokens. - Training Example:
With 210 TeraGS, 1024 GPUs, and 215,040 tokens/s, training takes roughly 108 days.
A sample configuration might be:- DP (using ZeRO stage 1): 8
- TP: 16
- PP (chunk): 8
- Global batch size: 2048 → micro batch size: 16
Recommendation Systems
Job Recommendation Pipeline: A Technical Overview
In a typical job recommendation system, we often split the process into two main parts: recommendations for enterprises (B-end) and recommendations for job seekers (C-end). The C-end optimization objective is straightforward: to increase the average number of applications per user (which also correlates with user retention—more applications typically lead to higher retention).
In the fine-ranking stage (discussed below), the main metric is usually the “apply rate.” Data is collected by logging which recommended items (jobs) users clicked or didn’t click, and this feedback becomes the training data for the model. Because job markets and user behaviors evolve quickly, it’s common to retrain the model daily (or even more frequently for scenarios like news recommendations) to capture the latest signals. Major model improvements or architecture changes are typically evaluated using offline metrics before being deployed.
1. Recall
The recall phase generally retrieves a large pool of candidates (e.g., 10,000 jobs). A common strategy is to use the job title provided by the user as input for a search-based recall. This can leverage embedding-based retrieval so that related but differently phrased titles (e.g., “JavaScript Developer” vs. “React Developer”) can still be recognized as similar.
To enable this, you might fine-tune an embedding model on a taxonomy of job categories (e.g., Game -> MOBA). Then, when performing an embedding search (based on cosine similarity), you can pull up jobs whose descriptions or titles closely match the user’s query. Filters (like job location, required skill levels, etc.) are also commonly applied at this stage.
2. Coarse Ranking
After the recall step, you typically have a large set of candidates (10,000). Coarse ranking narrows them down to a more manageable size (e.g., 500).
Often, this phase employs heuristic or rule-based strategies, such as favoring newer job postings or weighting more relevant titles more strongly. You can also incorporate factors like job popularity, how recently the job was posted, and various other metadata. The recall score itself can be one of the inputs. A straightforward method is to compute a composite score based on these features and then truncate the candidate list.
3. Fine Ranking
The fine-ranking stage is where a more sophisticated machine learning model (e.g., GBDT, LightGBM) comes into play. This model considers detailed item-level features—such as job category, skills required, and metadata—and user-level features, such as a user’s past clicks or applications, whether they are new or returning, their skill profile, and so on.
This stage may or may not further filter out candidates. The main goal is to precisely estimate the likelihood that a user will apply for a given job. Because these models rely on cross-feature interactions (e.g., whether a user’s skill set matches a job’s skill requirements), traditional tree-based models are often used. At the B-end (when recommending candidates to enterprises) it might be feasible to use more complex models like LLMs if the data size and latency constraints are different. However, for the C-end (job seekers), large-scale, real-time recommendation usually needs a more efficient model to handle high traffic.
4. Rerank
In some systems, there is an additional reranking step, which is often policy-driven. This might consider “freshness” (e.g., how new the job is) or external signals like Glassdoor ratings. In many cases, the logic in the coarse ranking and reranking steps can overlap, and some of these policy-based adjustments (or weighting rules) can be consolidated into one stage.
Ultimately, the entire pipeline—from recall to coarse ranking to fine ranking (and possibly reranking)—serves to strike a balance between personalization (making sure each user sees the jobs most relevant to them) and overall platform goals, such as showcasing newly posted or high-quality opportunities.
Open-Source Frameworks
Megatron
Developed by NVIDIA, Megatron is a framework for training large language models using PyTorch. It supports various transformer architectures (GPT, BERT, T5, etc.) and emphasizes:
- Tensor Parallelism (TP)
- Pipeline Parallelism (PP)
DeepSpeed can also be combined with Megatron (a common approach in the industry), and frameworks like ColossalAI offer alternatives with a more active community.
MLflow
An open-source lifecycle platform from Databricks that provides:
- ML Tracking: Record and query experiments.
- MLflow Projects: Standardized ML project descriptions for reproducibility.
- MLflow Models: Universal model packaging.
- MLflow Model Registry: Versioning and review management.
vLLM
vLLM is an inference system optimized for deploying large language models with high throughput and low latency, particularly in streaming scenarios. Innovations include:
- PagedAttention: Efficiently manages KV cache by partitioning keys/values into pages.
- Built-in batching to merge concurrent inference requests.
CUTLASS
CUTLASS (CUDA Templates for Linear Algebra Subroutines) is NVIDIA’s C++ template library designed to simplify and accelerate general matrix multiplication (GEMM) operations. It provides reusable and composable components that are highly optimized for CUDA.
System Design
ML System Design Principles
When designing an ML system:
- Define Requirements: Clearly specify inputs, outputs, and non-functional requirements (latency, availability, scalability).
- Document the Design: Use text and diagrams to record the system’s architecture.
- Three-Phase Approach:
- Data: Data gathering, preprocessing, and (if needed) scaling.
- Model: Training (using techniques like k-fold validation), evaluation, and (if needed) scaling.
- Deployment: Managing different environments (dev, test, prod), online A/B testing, scaling, and optimization (e.g., quantization).
- Feedback: Continuously interact with stakeholders to refine the design.
Inference Acceleration System
Problem: Design a machine learning inference acceleration system that guarantees low latency and high throughput, especially on resource-limited devices (e.g., mobile, edge devices, or overloaded servers).
Inputs:
- Model: Deep learning model (CNN, Transformer, recommendation model, etc.).
- Input Data: User behavior, image data, or text.
- Compute Resources: Available hardware (CPU, GPU, TPU, etc.).
Outputs:
- Inference Results: Predictions (e.g., recommended videos, image classifications, NLP responses).
- Performance Metrics: Latency, throughput (requests per second), model size, and accuracy.
Design Considerations:
- Low Latency:
- Compress the model using techniques like pruning (zeroing out unimportant weights based on magnitude, gradient, or activation) and post-training quantization (PTQ).
- For PTQ, export the model to ONNX, ensure all operators are supported, and build a TensorRT engine with proper calibration (using min-max or entropy calibration).
- High Throughput:
- Scale horizontally using container orchestration tools like Kubernetes.
- Implement caching (e.g., with Redis) for frequent requests.
- Utilize message queues (like Kafka) and load balancers to distribute incoming requests efficiently.
- For LLMs, consider frameworks like vLLM or DeepSpeed-Inference that optimize KV cache management through techniques such as PagedAttention.
Hardware Acceleration:
- GPU Clusters:
- For small, compressed models, a single GPU might suffice.
- For larger models (e.g., LLMs), distribute inference using tensor and pipeline parallelism. NVLink can enhance GPU-to-GPU communication.
- Alternative Accelerators:
- TPUs or specialized inference accelerators may be used based on workload characteristics.
- Supplementary Techniques:
- Knowledge distillation can further reduce model size while maintaining performance.
Scalability:
- Use Kubernetes to containerize the service.
- Leverage Horizontal Pod Autoscaling (HPA) and Cluster Autoscaler to dynamically add nodes as demand increases.
Handling Diverse Input Types:
- Time-Critical Requests:
- Deploy dedicated servers for low-latency tasks.
- Batch Processing:
- For less urgent data, add a preprocessing layer (e.g., converting images or text to embeddings) and queue the data for inference.
- Monitoring:
- Use performance monitoring tools like Prometheus and conduct stress testing and A/B testing to validate system performance.
Compiler Optimizations
Loop Unrolling:
Loop unrolling can accelerate code execution by:
- Reducing Loop Overhead: Minimizing jump instructions and loop control.
- Improving Branch Prediction: Lessening the chance of branch mispredictions.
- Enhancing Instruction-Level Parallelism: Allowing the compiler and CPU to better schedule and optimize instructions.
However, note that excessive unrolling can increase code size, potentially leading to instruction cache issues. Modern compilers often perform loop unrolling automatically, so manual intervention may yield only marginal improvements.