Mehdi Rezagholizadeh – Huawei Noah’s Ark, Canada
Knowledge distillation (KD) was originally proposed and became very prominent for neural model compression and later on showed its potentials for improving the accuracy of neural models as well. This talk concerns introducing knowledge distillation and answering how and why knowledge distillation helps in compressing/training neural networks. Despite the great success of KD, in this presentation we revisit KD from three different perspectives: data, model, and training. From data point of view, we deploy a MiniMax approach to spot regions in the input space where the teacher and student networks diverge the most from each other and generate some augmented data from the training samples to cover these maximum divergence regions accordingly. The new augmented samples will enrich the training data to improve KD training. From the model point of view, original KD technique is only taking the information of the last layer of the teacher and student networks to matching their outputs. However, it is shown in the literature that matching the internal representations or other internal statistics of the two networks can lead to a better performance in some architectures such as transformer-based models. This observation opens a new domain of research on what is the best way of matching internal representations of two networks. From the training point of view, based on VC dimension theory, this is evident that KD performs poorly when the capacity gap between the teacher and student networks becomes large. This problem is getting more serious in NLP considering the ever growing size of pre-trained models. We propose our solution based on an annealing technique in which the student is exposed to a smoothed version of the teacher output at early stages and this smoothness is gradually reduced using a temperature factor during the training. This talk includes analysis on application insights, theoretical and empirical evidence as well as practical experiments to support the effectiveness of our proposed methods.