Contents

2024 Technical Notes

The content in this page here is translated by O1.

How to reduce overfitting?

  • Use dropout
  • Use normalization
  • Use regularization (add penalty term during training). Two approaches:
    1. Add a penalty in the loss (e.g., for L2, or for L1).
    2. Directly apply weight decay (i.e., multiply weights by ).
  • Increase the data size, use early stopping, etc.

Why can’t Zero in half-precision (16-bit) train/ update parameters directly?
Because gradient updates need high precision. Otherwise, small gradients may vanish.

Leaky ReLU: Introduces a small negative slope so that it won’t produce zero derivatives.

where is a small constant (e.g., 0.01).

Given .

Why? One way to see it is to consider a small scalar case or do index-wise expansion.

By analogy, for convolution, the gradient w.r.t. the input is basically the “reverse convolution” of the gradients. You can derive it similarly by looking at small scalar examples.

LoRA updates the weight matrix without altering the original weights. The updated weight can be expressed as:

where and are low-rank matrices. LoRA optimizes and while keeping the original fixed. It allows rapid fine-tuning on specific tasks without sacrificing the performance of the original model.

Why use normalization? It keeps each layer’s outputs in a relatively stable distribution (e.g., mean 0, variance 1), which helps the activation function. For example, if after BN we apply ReLU, normalizing helps avoid large numbers of negative values and speeds up convergence while mitigating overfitting.

Consider an input with shape NCHW:

  • Batch Norm (BN): normalizes across the batch dimension (N) plus . Subtract mean, divide by standard deviation.
  • Layer Norm (LN): normalizes across a single sample’s feature dimension (C) along with and .
  • Group Norm (GN): splits channels into G groups, normalizes each group, then concatenates.
  • Instance Norm (IN): normalizes only over H and W for each sample/channel.

py

def adam_optimizer(grad, params, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8, num_iterations=10):
    m = np.zeros_like(params)
    v = np.zeros_like(params)

    for t in range(1, num_iterations + 1):
        g = grad(params)
        m = beta1 * m + (1-beta1) * g
        v = beta2 * v + (1-beta2) * g * g

        m_hat = m / (1 - beta1 ** t)
        v_hat = v / (1 - beta2 ** t)
        params = params - learning_rate * m_hat / (np.sqrt(v_hat) + epsilon)
    
    return params
  • L2 Loss: Mean Squared Error (MSE),

  • L1 Loss: Mean Absolute Error (MAE),

  • Huber Loss: uses L2 loss for small errors, L1 loss for large errors.

  • Example:

    bash

    Sample    True Label(y)    Model Pred Prob(p)
    1             1                   0.9
    2             0                   0.2
    3             1                   0.6

    The loss would be .

  • Here is the softmax probability of the -th class.

In practice, for binary classification we usually combine a sigmoid with BCE, and for multi-class tasks we often combine softmax with cross entropy.


It measures the uncertainty in predicting the next token. A perfectly correct model would have a perplexity of 1. Higher perplexity indicates higher uncertainty.

Essence of Attention: “lookup” the context that best helps you predict the output.

Transformer is composed of an Encoder and a Decoder:

  • The Encoder processes the input data and produces hidden states.
  • The Decoder transforms the hidden states into the target sequence.

Positional encoding: uses sine or cosine functions (or an embedding) to encode position info, e.g., . Another way is a learnable position embedding.

Within each Encoder/Decoder block, we have:

  • Multi-head self-attention
  • Feed Forward Network (two linear layers + activation like ReLU or GELU)
  • Layer Normalization
  • Residual connections

( is the hidden size per head, the factor is a scaling to avoid overly large dot products.)

Cross-attention (in the decoder) uses Q from the decoder’s hidden state and K, V from the encoder output.

There is also masked self-attention in the decoder to block future tokens during training for autoregressive tasks.

BERT typically uses two loss components:

  1. Masked Language Model (MLM)
  2. Next Sentence Prediction (NSP)

How to get a pre-trained embedding?

Previously, Word2Vec was popular (CBOW or Skip-gram). Now, BERT-style embeddings are more common:

  • WordPiece tokenization (e.g., “un”, “##happi”, “##ness”) handles unknown words better than classic word-level embedding.
  • BERT obtains a context-dependent vector for each token.
  • BERT uses an embedding matrix (30,000 tokens), plus positional embeddings, plus segment embeddings, then feeds into the Transformer.
  • It is pre-trained on large-scale data using MLM + NSP.

GPT differs from BERT in that it looks only at the left context, training to predict the next token. This makes GPT better for generative tasks.

Why use Layer Norm instead of Batch Norm?
Batch sizes (N) can be variable in sequence tasks. Layer Norm normalizes along feature dimensions within one sample, which does not depend on batch size.

Why is there a in the attention formula?
It acts as a scaling factor. If we didn’t scale, the softmax might become too “sharp.” The factor arises because, if Q and K have variance 1, their dot product’s variance grows with dimension .

Hugging Face: a widely-used unified API for tokenizers and model inference.

T5: Text-to-Text Transfer Transformer. It converts all NLP tasks into text-to-text format.

RoBERTa: removes Next Sentence Prediction and focuses more on MLM, among other training optimizations.


Residual Network with 50 layers, used for classification.

Key points:

  1. Residual blocks: skip connections mitigate gradient vanishing.
  2. Each block has CNN + BN + ReLU.
  3. Input size: . Output: class probabilities.

Structure roughly:

bash

Image (3*224*224)
-> Conv1 -> 64*112*112
-> Conv2 -> 256*56*56
-> Conv3 -> 512*28*28
-> Conv4 -> 1024*14*14
-> Conv5 -> 2048*7*7
-> Avg Pool -> 2048*1*1
-> FNN + Softmax -> 1000 classes

where is predicted by a neural network.

Loss: MSE between the predicted noise and the actual noise .

  • FID (Fréchet Inception Distance): measures distribution similarity between generated images and real images. Lower is better.
  • PSNR (Peak Signal-to-Noise Ratio): focuses on pixel-level difference. Higher is better, but may not reflect perceptual quality well.
  • SSIM (Structural Similarity Index Measure): focuses on structural similarity (luminance, contrast, structure). Higher is better.
  • IS (Inception Score): uses a pretrained classifier (e.g., Inception) to evaluate realism (how confident the classifier is) and diversity (distribution spread). Higher is better.
  • UnetGenerator: Convolution + BatchNorm + ReLU in an encoder-decoder structure that shrinks down to a bottleneck, then upsamples. Skip connections (concatenate encoder outputs) preserve spatial info.
  • PatchDiscriminator: outputs a probability map. It slides over the image in patches, focusing on local realism, and reduces parameter count.

Swin Transformer introduces Transformers to CV:

  • Split the image into non-overlapping windows to compute self-attention locally.
  • To enhance cross-window information, Swin uses shifted windows in consecutive layers so patches can interact with adjacent regions.

CLIP (Contrastive Language–Image Pre-training): a multimodal model that maps images and text into a shared semantic space. It is trained via contrastive learning so that matching image-text pairs have high similarity, while non-matching pairs have low similarity.


asynciocoroutines in Python.

Coroutines implement a single-threaded concurrency model. An event loop schedules multiple coroutines. During I/O wait, the loop switches to another coroutine. It’s ideal for I/O-bound tasks, especially asynchronous network I/O (non-blocking sockets, event notifications, etc.). File I/O, however, might still be blocking on many operating systems, though newer Linux kernels have io_uring and AIO to support asynchronous file operations.

Note that for CPU-bound tasks, asyncio may not help since it’s still single-threaded.


For some tree-based structure, you may prefer it over llama-index. For example, HierarchicalNodeParser splits and returns a list of nodes. Internally, you might need get_deeper_nodes to retrieve the nodes at a particular level. Under the hood, it’s just iterating the list in a certain order. If you directly want to retrieve nodes from a certain tree level (like sentences), you can do it in if you maintain a dict of nodes by level.

Common evaluation metrics for RAG:

  • Accuracy, Precision, Recall, F1
  • Mean Reciprocal Rank (MRR): focuses on the position of the first relevant result; closer to 1 is better.
  • Normalized Discounted Cumulative Gain (nDCG): considers ranked lists and relevance scores. Values closer to 1 are better.

Shared memory does not overflow silently; an overflow will raise errors.

Modern hardware usually links memory usage to speed. Saving memory often saves time.

In CUDA, the typical model is one thread handles a single computation. Many blocks are scheduled by the hardware. Instead of a for loop in one thread, you have thousands of threads, each doing a small piece of work.


  • Linear Regression: fit a line/plane for input-output relationships.

  • Logistic Regression: pass a linear model through a logistic (sigmoid) function for binary classification.

  • Support Vector Machine (SVM): use different kernels (linear, polynomial, Gaussian) to map data into a high-dimensional space, then find a linear separating hyperplane.

  • K-Nearest Neighbors (KNN): find the nearest points, use majority vote or average to predict.

  • Decision Tree: iteratively split on feature conditions.

  • Random Forest: an ensemble of decision trees.

  • Naive Bayes: assumes independence among features (not always realistic).

  • K-Means Clustering: an unsupervised method that groups data into clusters by minimizing within-cluster variance.

  • Gradient Boosted Decision Trees (GBDT): builds decision trees in sequence, each new tree fitting the residual (negative gradient) from the previous model.

    • XGBoost is a common implementation.
    • LightGBM uses histogram-based algorithms, leaf-wise growth, plus GOSS (focuses on large gradients) for efficiency.
    • CatBoost has built-in support for categorical features.