Contents

Summary: ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Paper

Note: Also read distrubuted_training_strategy if you are interested in this topic.

  • It introduces ZeRO (Zero Redundancy Optimizer), a method to optimize memory usage for extremely large-scale NN training (potentially up to trillions of parameters).

  • ZeRO-DP splits the optimizer states, gradients, and parameters across different devices, thereby reducing redundant memory usage and enabling larger batch sizes and faster training throughput.

  • It also proposes ZeRO-R to handle activation memory, temporary buffers, and memory fragmentation, aiming to further reduce memory footprints while retaining efficient computation.

  • Unlike traditional DP or TP (tensor parallel) (which either replicate the model on all devices or split layers vertically), ZeRO partitions the model’s states and only re-materializes them when needed.

  • ZeRO’s approach retains the simplicity and low communication overhead of DP while drastically reducing memory overhead—something not achieved by prior TP or pipeline-parallel (PP) techniques.

  • It demonstrates analytically and experimentally that ZeRO could feasibly train a trillion-parameter model by leveraging enough GPUs and employing its partitioning strategy.

  • ZeRO does not require complicated model refactoring, whereas prior solutions, such as Megatron-LM or G-Pipe, often need significant changes to the model architecture or training loop.

  • The authors trained GPT-2–style transformer models ranging from 1.5B parameters up to 170B parameters on hundreds of V100 GPUs.

  • They compared ZeRO against baseline DP (PyTorch DDP) and a SOTA TP system (Megatron-LM) to show throughput (TFLOPS) improvements and memory savings.

  • Experiments scaling a 60B model from 64 to 400 GPUs demonstrated “super-linear” speedups when increasing the DP degree, due to higher permissible batch sizes per GPU.

  • Detailed measurements of GPU memory usage were provided, showing how partitioning optimizer states, gradients, and parameters significantly reduces per-GPU requirements.

  • The authors trained a 17B-parameter language model (Turing-NLG) that achieved a new SOTA (Webtext-103 perplexity of 10.21), illustrating real-world applicability.

  • Although ZeRO can fit a trillion-parameter model in terms of memory, training it end-to-end would still take an impractically long time on today’s (2020) hardware (potentially months or more).

  • While it claims only a 1.5× communication overhead relative to standard DP using stage 3, that cost can be non-trivial in scenarios with limited interconnect bandwidth.

  • Properly tuning activation checkpoint partitioning (e.g., deciding when to offload to CPU) may require domain-specific heuristics and can introduce additional overhead if not carefully managed.

  • Achieving the advertised efficiency sometimes requires many GPUs with specific high-bandwidth interconnects (e.g., NVSwitch within a node, high-speed inter-node links).

  • Although transformers are a dominant architecture, the paper does not deeply explore how ZeRO might handle other model types with different memory patterns.

  • Investigate more strategies for dynamically offloading activations and model states, taking into account heterogeneous memory tiers (e.g., GPU HBM, CPU DRAM, NVMe).

  • Develop a system that automatically selects the best partitioning schedule or CPU-offload policy based on real-time memory usage, communication bandwidth, and arithmetic intensity.

  • Extend ZeRO’s partitioning approach and memory optimizations to convolution-based architectures, graph NNs, and emerging large-scale models.

  • As compute clusters grow, explore how ZeRO’s techniques scale in exascale HPC environments with tens of thousands of GPUs, potentially refining communication collectives.

  • Adapt ZeRO for downstream tasks that require partial model updates or parameter-efficient methods (e.g., LoRA, adapters), ensuring that memory is minimized for both pre-training and fine-tuning stages.

  • GPU HBM (High-Bandwidth Memory): A type of high-speed, on-package memory used by modern GPUs, providing very high bandwidth and low power consumption compared to traditional GDDR memory.
  • NVMe (Non-Volatile Memory Express): A high-performance storage interface protocol for solid-state drives, designed to reduce latency and improve input/output (I/O) operations
  • NVSwitch: A fully-connected, high-speed switch architecture by NVIDIA that allows GPUs in the same server (e.g., DGX-2) to communicate at very high bandwidth and low latency.
  • Infiniband EDR (Enhanced Data Rate): A network interconnect technology providing high bandwidth and low latency, often used for HPC clusters and GPU communication across nodes.
  • Intra-Node vs. Inter-Node: Intra-node refers to devices (e.g., GPUs) within the same physical machine; inter-node refers to devices across different physical machines in a cluster.
  • DGX-2: An NVIDIA system that packs up to 16 V100 or A100 GPUs with NVSwitch for high-bandwidth, all-to-all GPU communication.