Abstract:
The number of known protein sequences has grown exponentially in recent years, owing to rapid development of sequencing technologies. However, biologists are unable to catch up in finding different attributes of newly discovered protein sequences, as performing lab experiments is tedious and expensive. Computational methods to predict different attributes of proteins are thus being frequently sought. One of the principal tasks of this thesis is to pursue sequence based computational methods for several protein attribute prediction problems. These include Golgi Apparatus (GA) resident protein type prediction, DNA-binding protein (DNA-BP) prediction and protective antigen prediction. Through solving these problems using a sequence based methodology, our research empirically asserts the natural belief that a protein’s functional and
structural information are intrinsically encoded within its primary sequence.
Given a GA protein, an important research question is whether it is a cis-Golgi protein or a trans-Golgi protein. This is because correct classification of GA proteins can lead to drug development against various congenital, neurodegenerative and inherited diseases. We propose a sequence based prediction model for sub-Golgi protein types. A DNA-BP binds to a DNA to regulate and affect various cellular processes. As such, DNA-BPs can potentially be used for drug development in treating genetic diseases and cancers. We develop a DNA-BP predictor, that extracts meaningful information directly from the protein sequences, without any dependence on functional domain or structural information. Recursive Feature Elimination (RFE) is then applied to optimize the number of features used in the prediction process. Another important protein attribute prediction problem that we tackle is whether a given pathogenic protein has the ability of invoking adaptive immune response to subsequent exposure to the specific pathogen or related organisms. Such proteins are called protective antigens and are of immense importance in vaccine preparation and drug design. We propose a protective antigen predictor that, again, solely exploits sequence based features to provide a pathogen independent prediction model. Our predictor can be used to quickly sift through any pathogen proteome and predict a list of potential protective antigens.
Through the exercise of building these three predictors, we formulate a general framework for feature extraction and selection that can be applied to any protein attribute prediction problem. One of the distinct characteristics of this framework is to exploit only the proteins’ primary sequence based features, leaving out any structural, evolutionary or functional features, thereby making the whole framework lightweight. The framework involves counting small substrings, with or without gaps, in a protein sequence, to represent the protein in a discrete model, followed by a novel approach of feature selection.
Another focus of this thesis is phylogeny, which is the study of the evolutionary relationships among different species, genes or proteins (taxa). When gene copies are sampled from various species, the gene tree relating these copies might disagree with the species phylogeny. This discord can arise from horizontal gene transfer, incomplete lineage sorting (ILS), and gene duplication and extinction. Summary methods of species tree estimation work by first estimating the individual gene trees from respective gene sequence alignments, and then summarizing these gene trees to reconstruct the species phylogeny. To speed up the step of gene tree estimation, we propose a set of distance measures between two biological sequences utilizing the concepts of minimal and relative absent words. The computation of these distance measures is done in an alignment-free manner. We demonstrate the use of these techniques experimentally and show how the pairwise distance matrix thus produced can be used to reconstruct the gene phylogeny. When the gene tree discordance is modeled by ILS, coalescent-based methods need to be applied to accurately estimate the species tree. One such method is Quartet FM (QFM), which is highly accurate but does not scale to large numbers of taxa. We propose boosting the scalability and performance of QFM through the application of disk covering methods (DCMs). Extensive experimentation on large simulated datasets demonstrates superiority of our method over ASTRAL, a widely used and highly accurate coalescent-based species tree estimation method that is statistically consistent under the multi-species coalescent model.
Overall, this thesis offers a generic framework for tackling protein attribute prediction problems using information solely from the protein sequence and attempts to scale existing phylogeny estimation methods to larger datasets.