Abstract:
Succinylation of lysine residue is a special type of post-translational modification(PTM).Ithasacrucialroleinbalancingtheprocessesofcells.Abnormalsuccinylation can be the cause of cancers, metabolism diseases, inflammation andnervous system diseases.Detecting succinylation sites is of great importance toexplore the function of proteins.However, the experimental methods to detectsuccinylation sites are costly,time and labor consuming.This thus calls forcomputational models with high efficacy and attention has been given in theliterature for developing such models, albeit with only moderate success in thecontextofdifferentevaluationmetrics.Inparticular,theexistingworksfailedto balance the two metrics, sensitivity and specificity, leaving a large room forimprovements in this context. One important aspect in this context is the biochemicaland physicochemical properties of amino acids, which appear to be useful as featuresfor such computational predictors. However, some of the existing computationalmodelsdidnotusethebiochemicalandphysicochemicalpropertiesofaminoacids,while some others used them without considering the inter-dependency among theproperties.
In this thesis, we revisit the computational prediction of succinylated lysineresidue (SLR) and use a broad spectrum of weaponry to tackle this problem. Wefirst focus on the biochemical and physicochemical properties of amino acids andformulateanoptimizationproblemtofindcombinationthatismoresuitablefortheproblem at hand considering their inter-dependencies and other factors. In particular,we propose a variant of genetic algorithm, called IBCGA, to search for suitablecombinations thereof for efficient prediction of SLRs. In this context, we leveragethe power of Random Forest (RF) and Balanced RF (a variant of RF to handleimbalanceddata).
We then propose three deep learning architectures, CNN+Bi-LSTM (CBL),Bi-LSTM+CNN (BLC) and their combination (CBL BLC) thereby leveraging thepotentialofdeepneuralnetworkarchitecturesforSLRprediction.Wealsoemploydifferent ensembling techniques to improve upon the performance of our models,which includes heterogeneous ensembling of traditional ML models with deeplearning architectures as well. Finally, we apply differential evolution to tune thethreshold of ensemble classifiers thereby providing the biologists and practitionerswithaknobtobalancethesensitivityandspecificity.
Thecombinationsofbiochemicalandphysicochemicalpropertiesderivedthroughouroptimizationprocessachievebetterresultsthantheresultsachievedbythe combination of all the properties. In this context, one of the best performingcombinationsconsistsofonlytwoproperties.Asforourdeeplearningarchitectures,