Title Statistical analysis of weight discretization in deep learning
Translation of Title Giliųjų neuroninių tinklų svorių diskretizavimo statistinė analizė.
Authors Jonkus, Kamilis
Full Text Download
Pages 54
Keywords [eng] neural networks ; weight quantization ; integer quantization ; clipping threshold ; statistical distributions
Abstract [eng] The practical application of deep neural networks requires substantial computational resources. These costs can be reduced by approximating model parameters and layer activations with lower-precision numerical formats. However, when the bit width is too low, model accuracy may decrease substantially. This degradation is related to clipping distortion at the edges of the quantization interval and rounding distortion within the discretization grid. Neural network weight quantization was investigated using statistical methods. Four parametric distributions, Gaussian, Laplace, Student’s t, and generalized Gaussian, were fitted to trained convolutional network weights at both layer and channel granularity. For each tensor or channel, the best-fitting distribution was selected. Based on the selected distribution, an MAE optimal symmetric clipping bound was derived for integer quantization. The analytical clipping bound reduced mean absolute quantization error in most evaluated settings compared with the standard MinMax method. However, lower weight reconstruction error did not consistently lead to higher classification accuracy. On ImageNet, the analytical method achieved lower accuracy than MinMax for ResNet18, even when its MAE was smaller. This indicates that a clipping threshold that is optimal for weight approximation is not necessarily optimal for the final classification task. To explain this discrepancy, the relationship between quantization-induced output mean shifts and accuracy degradation was investigated. Previous studies have shown that weight quantization can shift the means of layer-output distributions and thereby reduce model accuracy. The hypothesis was tested by training 500 small convolutional networks on MNIST. Under 4-bit quantization, a statistically significant positive correlation was found between total output mean shift and accuracy drop for both clipping methods: 0.761 for the analytical method and 0.707 for MinMax. Bias correction was then applied to compensate for these output mean shifts. The floating-point ResNet18 model achieved 69.86% top-1 accuracy on ImageNet. After bias correction, the accuracy of the MinMax-quantized INT8 model increased from 69.71% to 69.76%, while the accuracy of the analytically quantized INT8 model increased from 67.56% to 68.57%. Thus, bias correction reduced part of the accuracy deficit caused by quantization, especially for the analytical method, but MinMax remained the more accurate clipping method in the final ImageNet evaluation.
Dissertation Institution Kauno technologijos universitetas.
Type Master thesis
Language English
Publication date 2026