Retour

G-2023-36

Partially-separable loss to parallellize partitioned neural network training

, et

référence BibTeX

Historically, the training of deep artificial neural networks has relied on parallel computing to achieve practical effectiveness. However, with the increasing size of neural networks, they may no longer fit into the memory of a single computational unit. To address this issue, researchers are exploring techniques to distribute the training process on powerful computational grids or less capable edge devices. In computer vision, multiclass classification neural networks commonly use loss functions depending non-linearly on all class raw scores, making it impossible to compute independently partial derivatives of weight subsets during training. In this work, we propose a novel approach for distributing neural network training computations using a master(s)-workers setup and a partially-separable loss function, which is a sum of element loss functions. Each element loss only depends on a specific subset of variables, corresponding to a subpart of the neural network, whose derivatives can be computed independently. It makes it possible to distribute every element loss and its corresponding neural network subpart across multiple workers, coordinated by one or several masters. The master(s) will then aggregate worker contributions and will perform the optimization procedure before updating the workers. To ensure that each element loss is parameterized by a small fraction of the neural network's weights, the architecture must be adapted, which is why we propose separable layers. Numerical results show the viability of partitioned neural networks considering a partially-separable loss function using state-of-the-art optimizers. Finally, we discuss the flexibility of a partitioned neural network architecture and how other deep learning techniques may reflect on it. In particular, in a federated learning context, it can preserve worker privacy, as each worker possesses only a fragment of the network, and reduce communication.

, 11 pages

Axe de recherche

Application de recherche

Document

G2336.pdf (470 Ko)