Contents

Summary: Large Scale Distributed Deep Networks

Paper

  • Presents DistBelief, a framework for parallel and distributed training of DNNs.
  • Introduces novel large-scale optimization methods (Downpour SGD, Sandblaster L-BFGS) that enable training models with billions of parameters.
  • Demonstrates how asynchronous and batch optimization procedures can effectively reduce training time and handle model sizes larger than what single-machine setups (including GPUs) can accommodate.
  • Proposes Downpour SGD, an asynchronous variant of stochastic gradient descent(SGD) that tolerates inconsistent updates yet still converges well on non-convex deep networks.
  • Implements Sandblaster L-BFGS, showing that a distributed second-order method can be made competitive on very large models.
  • Offers a parameter server architecture that supports model parallelism and data parallelism, effectively scaling to extremely large datasets and billions of parameters, whereas prior works often focused on relatively smaller models or required synchronization constraints.
  • A 5-layer deep network with ~42M parameters, trained on 1.1 billion frames of audio data for speech recognition. Compared single-replica SGD, GPU-based training, Downpour SGD with/without Adagrad, and Sandblaster L-BFGS.
  • Models up to 1.7 billion parameters, showcasing how large-scale distributed training can outperform previous state-of-the-art results on 21k-category ImageNet.
  • Demonstrated how speedup evolves with the number of machines (partitions), especially for fully-connected vs. locally-connected models, to illustrate the efficiency of model parallelism.
  • Communication overhead becomes a bottleneck when the network is fully-connected or when partitioning exceeds an optimal point, limiting the potential speedup.
  • The asynchronous approach in Downpour SGD introduces extra stochasticity and lacks a robust theoretical guarantee for non-convex problems, although it works well in practice.
  • Sandblaster L-BFGS can require a large coordination overhead and is most beneficial with very large computing resources.
  • The methods require significant cluster infrastructure, which might not be accessible to all researchers or organizations.
  • Develop theoretical analyses for asynchronous non-convex optimization methods like Downpour SGD.
  • Investigate more sophisticated adaptive learning rate strategies that can further exploit distributed environments and handle large-scale data dynamics.
  • Expand the parameter server concept to unify various forms of parallelization (eg, specialized hardware) to improve both speed and scalability.
  • Explore fault-tolerance enhancements, ensuring that large-scale distributed training can recover seamlessly from major node or network failures in a more automated fashion.
  • Convex Optimization: An optimization setting where the objective function is convex, ensuring any local minimum is a global minimum.
  • Non-convex Optimization: An optimization setting where the loss landscape can have multiple local minima and saddle points, typical in DNNs.