TY - JOUR
T1 - Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent
AU - Wang, Bao
AU - Nguyen, Tan
AU - Sun, Tao
AU - Bertozzi, Andrea L.
AU - Baraniuk, Richard G.
AU - Osher, Stanley J.
N1 - Funding Information:
∗Received by the editors October 15, 2021; accepted for publication (in revised form) January 5, 2022; published electronically May 31, 2022. https://doi.org/10.1137/21M1453311 Funding: The work of the authors was supported by National Science Foundation grants DMS-1924935, DMS-1952339, CCF-1911094, IIS-1838177, and IIS-1730574, DOE grant DE-SC0021142, ONR grants N00014-18-12571, N00014-17-1-2551, and N00014-18-1-2047, AFOSR grant FA9550-18-1-0478, DARPA grant G001534-7500, a Vannevar Bush Faculty Fellowship, NSF grant 2030859 to the Computing Research Association for the CIFellows Project, the NSF Graduate Research Fellowship Program, and NSF IGERT Training Grant DGE-1250104. †Department of Mathematics and Scientific Computing and Imaging Institute, University of Utah, Salt Lake City, UT 84112 USA ([email protected]). ‡Department of Mathematics, UCLA, Los Angeles, CA 90095 USA ([email protected], [email protected], [email protected]). §College of Computer, NUDT, China, 999078 ([email protected]). ¶Department of ECE, Rice University, Houston, TX 77005 USA ([email protected]).
Publisher Copyright:
© by SIAM. Unauthorized reproduction of this article is prohibited.
PY - 2022
Y1 - 2022
N2 - Stochastic gradient descent (SGD) algorithms, with constant momentum and its variants such as Adam, are the optimization methods of choice for training deep neural networks (DNNs). There is great interest in speeding up the convergence of these methods due to their high computational expense. Nesterov accelerated gradient with a time-varying momentum (NAG) improves the convergence rate of gradient descent for convex optimization using a specially designed momentum; however, it accumulates error when the stochastic gradient is used, slowing convergence at best and diverging at worst. In this paper, we propose scheduled restart SGD (SRSGD), a new NAG-style scheme for training DNNs. SRSGD replaces the constant momentum in SGD by the increasing momentum in NAG but stabilizes the iterations by resetting the momentum to zero according to a schedule. Using a variety of models and benchmarks for image classification, we demonstrate that, in training DNNs, SRSGD significantly improves convergence and generalization; for instance, in training ResNet-200 for ImageNet classification, SRSGD achieves an error rate of 20.93% versus the benchmark of 22.13%. These improvements become more significant as the network grows deeper. Furthermore, on both CIFAR and ImageNet, SRSGD reaches similar or even better error rates with significantly fewer training epochs compared to the SGD baseline. Our implementation of SRSGD is available at https://github.com/minhtannguyen/SRSGD.
AB - Stochastic gradient descent (SGD) algorithms, with constant momentum and its variants such as Adam, are the optimization methods of choice for training deep neural networks (DNNs). There is great interest in speeding up the convergence of these methods due to their high computational expense. Nesterov accelerated gradient with a time-varying momentum (NAG) improves the convergence rate of gradient descent for convex optimization using a specially designed momentum; however, it accumulates error when the stochastic gradient is used, slowing convergence at best and diverging at worst. In this paper, we propose scheduled restart SGD (SRSGD), a new NAG-style scheme for training DNNs. SRSGD replaces the constant momentum in SGD by the increasing momentum in NAG but stabilizes the iterations by resetting the momentum to zero according to a schedule. Using a variety of models and benchmarks for image classification, we demonstrate that, in training DNNs, SRSGD significantly improves convergence and generalization; for instance, in training ResNet-200 for ImageNet classification, SRSGD achieves an error rate of 20.93% versus the benchmark of 22.13%. These improvements become more significant as the network grows deeper. Furthermore, on both CIFAR and ImageNet, SRSGD reaches similar or even better error rates with significantly fewer training epochs compared to the SGD baseline. Our implementation of SRSGD is available at https://github.com/minhtannguyen/SRSGD.
KW - Nesterov accelerated gradient
KW - deep learning
KW - restart
KW - stochastic optimization
UR - http://www.scopus.com/inward/record.url?scp=85134662432&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85134662432&partnerID=8YFLogxK
U2 - 10.1137/21M1453311
DO - 10.1137/21M1453311
M3 - Article
AN - SCOPUS:85134662432
SN - 1936-4954
VL - 15
SP - 738
EP - 761
JO - SIAM Journal on Imaging Sciences
JF - SIAM Journal on Imaging Sciences
IS - 2
ER -