Abstract:
A protein is a large, complex macromolecule, and it has many crucial roles in the human body as it performs most of the work in cells and tissues. It consists of one or multiple extended sequences of amino acid components. Another important biomolecule that comes after DNA and proteins is carbohydrates. Carbohydrates interact with proteins to facilitate various biological processes. Sev- eralbiochemicalexperimentsexisttostudyprotein-carbohydrateinteractions,buttheyareexpensive, time-consuming,andchallenging.Asaresultoftheswiftadvancementsinsequencingtechnologies, thequantityofrecognizedproteinsequenceshassurgedexponentially.Therefore,developingacom- putational technique from known protein sequences for effectively predictingprotein-carbohydrate bindinginteractionshasledtotheemergenceofaprominentnewareaofstudy.
Mostofthecomputationalapproachesforprotein-carbohydratebindingsitespredictionarebiased towards the negative class. This is due to the fact that the count of carbohydrate-binding residues isconsiderablylowercomparedtonon-carbohydrate-bindingresiduesinthebenchmarkdatasets.In this thesis, we introduce a proficient ensemble machine learning model called ‘StackCBEmbed’ for the accurate classification of protein-carbohydrate binding interactions at the residue level within establishedproteinsequences.StackCBEmbeddemonstratesamorebalancedbehaviorcomparedto the state-of-the-art methods in terms of accurately predicting both the positive and negative data points.
Ourresearchusedabenchmarktrainingdatasetandtwoseparateindependenttestsets.Through the use of the Incremental Feature Selection method, we identified crucial sequence-based features and picked the most impactful ones. Furthermore, we integrated embedding characteristics from a pre-trained transformer-based language model known as ‘ProtT5-XL-Uniref50.’ To the best of our knowledge, this is the initial endeavor to utilize a protein language model for predicting protein- carbohydrate binding interactions. StackCBEmbed achieved sensitivity, specificity, andbalanced
v
accuracyscoresof0.691,0.849,0.769and0.627,0.835,0.731inthetwoindependenttestsetsrespec- tively. Compared to the earlier prediction models that were benchmarked in the same datasets, our reportedresultsaresignificantlysuperior.Thus,wehopetheStackCBEmbedwillhelpdiscovernovel protein-carbohydrate interactions and advance the related research fields. StackCBEmbed is freely available as python scripts athttps://github.com/farah5112github/StackCBEmbed