Abstract:
Protein subcellular localization is defined as predicting the functioning location of a given protein inside the cell. It is considered an important step towards protein function prediction and drug design. The task of protein subcellular localization from primary protein sequences is crucial for understanding genome regulation and functions. Support vector machine (SVM) based learning methods are shown to be effective for predicting protein subcellular and subnuclear localizations. Extraction of informative features cooperating with SVM plays an important role in designing an accurate system for predicting protein subnuclear localization. Proteins are large, complex molecules that are required for the structure, function, and regulation of a body’s tissues and organs. Subcellular localization of proteins within a cell of the body is a mean of achieving functional diversity of protein. The process determines the access of protein’s interacting partners and enables the integration of proteins into functional biological networks. To gain access to appropriate molecular interaction partners, protein must be at the right place at the right moment. Therefore, the process of protein subcellular localization is crucial for protein synthesis and drug discovery for a broad range of medical conditions and diseases.
The current study described here introduces a novel machine learning approach in Bioinformatics for classifying 361 protein sequences found inside a cell. The sequences were in string (text) format, and a set of characteristics were extracted out of them. The feature set includes 8 physicochemical properties of the protein found in 6 target locations of a cell. A support vector machine (SVM) based model has been developed to learn these properties of proteins and test the model on an independent dataset, considering the well-known application of SVM in this field. The algorithm developed during this work selects an optimal range of parameters of SVM and adopts feature selection for obtaining the best performance of the algorithm. The proposed algorithm achieved an average accuracy of 90% in classifying proteins on the target locations. It shows better performance compared to several similar algorithms presented in the literature. The technique proposed here can further be extended for protein sequences found in any part of the body.