Abstract:
Despite the fact that best performing supervised learning models are often ensemble of many base classifiers or a very large and complex classifier, it may suffer by lack of resource related problem on smart-phones or Internet of Things (IoT) related devices. Model compression or distillation is the solution to turn a large and complex model or an ensemble of models into a smaller and faster model, usually without significant loss in performance which is more suitable for deployment in resource constrained devices. However, existing offline distillation methods rely on a strong pre-trained teacher model to solve complex problems leading to a lengthy and complex multi-phase training procedure. Its online counterparts on the otherhand address this limitation by introducing simultaneous training of student and teacher models where peer learning provide extra teaching knowledge. Though online distillation sometimes outperforms than the teacher based offline distillation, this teacher-student simultaneous learning strategy some time pulls to “the blind leading the blind” paradigm. To avoid these problems, we present a new single stage training procedure named Mixture of Distillation (MoD) which introduces a different kind of independent-dependent group learning for both student and teacher models and utilizes the complementary strengths of both offline and online distillation loss function. The Main objective of such a hybrid approach is to improve accuracy and to reduce the training time. Extensive evaluations on SVHN, MNIST, NumtaDB, CIFAR-10 and CIFAR-100 datasets substantiates that our proposed “Mixture of Distillation” improves the generalization performance more significantly than existing distillation methods.