Abstract:
Proteins are the ‘doers’ of all living organisms. Subcellular localization of human proteins plays an important role for inferring their structures and functions in our cells. Due to the recent advancement of molecule imaging techniques, the importance of analyzing image data for protein subcellular locations is now more than ever.At the same time,it is getting widely popular instead of conventional 1D protein amino acid sequence data. Classification of human protein cell localization is important to automate and accelerate different biomedical research tasks as well as the diagnosis of different diseases to reduce the time and manual effort.
Although the use of deep convolutional neural networks (DCNN) to classify images is a very straightforward approach, our task comes with multiple challenges. First, there are 28 distinct labels, assigned to a single image. Second, there is a strong class imbalance in the dataset with some labels appearing in less than 0.3% of the data. Lastly, the protein location classification task is to be performed across a wide range of different human cells. We aim at overcoming these through different approaches.
In this work, our principal goal is to presentan end-to-end system for the classification of mixed pattern protein subcellular localization from confocal microscopy images, using convolutional neural networks. We showed the outcomes of several experimental setups for a highly imbalanced dataset and investigated their effectiveness. We also demonstrate that oversampling outweighs cost sensitive learning to handle the data imbalance problem. In addition, we show that an ensemble of models always benefits our task. Using these observations, we managed to achieve a public macro F1 score of 0.574 and a private macro F1 score of 0.515 on the dataset for Kaggle competition - Human Protein Atlas Image Classification.