On the diffusion approximation of nonconvex stochastic gradient descent

Hu, Wenqing; Li, Chris Junchi; Li, Lei; Liu, Jian-Guo

doi:10.4310/AMSA.2019.v4.n1.a1

Contents Online

Annals of Mathematical Sciences and Applications

Volume 4 (2019)

Number 1

On the diffusion approximation of nonconvex stochastic gradient descent

Pages: 3 – 32

DOI: https://dx.doi.org/10.4310/AMSA.2019.v4.n1.a1

Authors

Wenqing Hu (Department of Mathematics and Statistics, Missouri University of Science and Technology, Rolla, Missouri, U.S.A.)

Chris Junchi Li (Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey, U.S.A.)

Lei Li (Department of Mathematics, Duke University, Durham, North Carolina, U.S.A.)

Jian-Guo Liu (Department of Physics and Department of Mathematics, Duke University, Durham, North Carolina, U.S.A.)

Abstract

We study the stochastic gradient descent (SGD) method in nonconvex optimization problems from the point of view of approximating diffusion processes. We prove rigorously that the diffusion process can approximate the SGD algorithm weakly using the weak form of master equation for probability evolution. In the small step size regime and the presence of omnidirectional noise, our weak approximating diffusion process suggests the following dynamics for the SGD iteration starting from a local minimizer (resp. saddle point): it escapes in a number of iterations exponentially (resp. almost linearly) dependent on the inverse stepsize. The results are obtained using the theory for random perturbations of dynamical systems (theory of large deviations for local minimizers and theory of exiting for unstable stationary points). In addition, we discuss the effects of batch size for the deep neural networks, and we find that small batch size is helpful for SGD algorithms to escape unstable stationary points and sharp minimizers. Our theory indicates that using small batch size at earlier stage and increasing the batch size at later stage is helpful for the SGD to be trapped in flat minimizers for better generalization.

Keywords

nonconvex optimization, stochastic gradient descent, diffusion approximation, stationary points, batch size

Full Text (PDF format)

C. J. Li is partially supported by RNMS11-07444(KI-Net) during his visit at Duke University.

The work of J.-G. Liu is partially supported by KI-Net NSF RNMS11-07444, NSF DMS-1514826 and NSF DMS-1812573.

Received 25 May 2018

Published 26 February 2019