Summary: Quantization and Training of Neural Networks

yewentao included in category Paper_summary

2025-02-15 2025-07-12 473 words 3 minutes

Contents

Download the paper

What is the paper about?

This paper introduces an integer-only quantization scheme for neural networks.

During training, the model simulates quantization (fake quantization) so that, at inference time, weights and activations can be processed as 8-bit integers. This leads to significant speedups and memory savings on common mobile hardware, while maintaining accuracy close to the floating-point baseline.

What is new about this specific paper, compared to prior work?

Both weights and activations are quantized to 8-bit, making inference fully integer-based with minimal floating-point involvement.
While many past works focus on compression or theoretical speed gains, this paper provides real device benchmarks on ARM CPUs (Qualcomm Snapdragon cores), demonstrating actual latency improvements.
The paper shows that even already-optimized networks (e.g., MobileNets) benefit further from this quantization, pushing the speed-accuracy boundary.
It details how to simulate and fold batch normalization during training for accurate integer inference, which was not commonly addressed in earlier quantization research.

What experiments were run to support the arguments in this paper?

ResNet (various depths), Inception v3, and MobileNet were trained and quantized, verifying only small accuracy drops for integer-based inference.
MobileNet SSD models were tested with 8-bit quantization, showing up to 50% latency reductions while preserving most of the detection performance.
Face detection & attributes: Experiments on face datasets demonstrated close to a 2× speedup in real hardware inference with minimal accuracy impact.
Different bit-widths for weights and activations were tested, revealing the trade-offs between lower precision and accuracy. (Ablation studies)

What are the shortcomings/limitations of this paper?

The work does not extensively investigate more aggressive (e.g., 4-bit or 2-bit) quantization, where the accuracy drop might be higher but the efficiency gains greater.
Although the paper covers several popular networks, further validation would be needed for other models (e.g., Transformer-based or very large-scale networks).
The paper mainly evaluates ARM NEON-based optimization; on different hardware or GPU/FPGA setups, integer arithmetic optimizations may vary.
Introducing fake quantization and batch normalization folding can increase the complexity of training, requiring extra steps for range estimation and delayed activation quantization.

What is a reasonable next step to build upon this paper?

Investigate whether 4-bit or mixed-precision schemes can maintain comparable accuracy while achieving even greater speedups.
Validate how integer-only inference performs on diverse platforms (e.g., edge GPUs, DSPs, or microcontrollers) and optimize code paths accordingly.
Extend this integer quantization approach to NLP models (Transformers), sequence data, or more complex multi-modal architectures.
Develop more advanced or dynamic quantization range techniques during training to handle rapid distribution shifts and further reduce quantization error.

Appendix

The most interesting part in 2.2 ( Integer-arithmetic-only matrix multiplication) can be illustrated as A×0.05 = A×(0.05×2^31)×(2−31), which uses the shift to present the 0.05 for integer.
ARM NEON: A SIMD (Single Instruction, Multiple Data) extension in ARM processor architectures that speeds up parallel processing of 8-, 16-, or 32-bit data.