The fundamental tool for training deep neural networks is Stochastic Gradient Descent (SGD). In this talk we discuss an algorithmically simple modification of (SGD) which significantly improves the training time as well as the generalization error for benchmark DNNs. We also discuss a related algorithm also allows for effective training of DNNs in parallel. Mathematically, we make a connection to Stochastic Optimal Control and the related nonlinear PDEs for the value function, Hamilton-Jacobi-Bellman equations. The PDE interpretation allows us to prove that the algorithm improves the training time. Further connections with PDEs and nonconvex optimization allows us to determine optimal value of of the hyper-parameters of the algorithm which lead to further improvements.
Welcome to everyone!