Abstract:
In the current post-genomic era, the amount of observed genomic data is continuously increasing at a rapid rate particularly due to the availability and relative affordability of various sequencing technologies. Robust computational methods, frameworks, tools, etc., able to tackle practical challenges, need to be developed at the same pace to properly leverage the information extracted from those data. One of the fundamental concepts derived from biological sequences is the evolutionary relationships among a group of organisms, which is manifested in the form of a tree called a phylogenetic tree. Such trees can offer crucial biological applications, for instance tracking the evolution of a disease, designing new drugs, etc. Multiple sequence alignment (MSA) is an important early step in the pipeline of inferring the phylogenetic tree where the given sequences are arranged according to evolutionary history. The characteristics, as well as the quality of the obtained MSA dramatically influence the accuracy of the estimated tree.
Usually, the MSAs are inferred by optimizing a single function or objective. The align- ments estimated under one criterion may be different from the alignments generated by other criteria, inferring discordant homologies and thus leading to different hypothesized evolutionary histories relating the sequences. Therefore multi-objective (MO) optimiza- tions, where multiple conflicting objective functions are being optimized simultaneously to generate a set of alternative alignments, seem appealing and have been considered in the literature. However, no theoretical or empirical justification with respect to a real- life application has been shown for a particular MO formulation. In this thesis, for the first time, we systematically study the question (Q-A) of whether an application-aware (in this case phylogeny-aware) metric can guide us in choosing appropriate MO formu- lations that can result in better phylogeny estimation. Employing MO metaheuristics, we demonstrate that (a) trees estimated on the alignments generated by MO formula- tion are substantially better than the trees estimated on the alignments generated by the
vii
state-of-the-art MSA tools and (b) highly accurate alignments with respect to popular measures do not necessarily lead to highly accurate phylogenetic trees.
PASTA (Practical Alignments using SAT´e and TrAnsitivity) is a state-of-the-art method for computing MSAs, well-known for its accuracy and scalability. It iteratively co-estimates both MSA and maximum likelihood (ML) phylogenetic tree. Currently, PASTA uses the ML score as its sole optimization criterion. We strengthen our study of the question Q-A through the integration of multiple application-aware objectives into PASTA to examine its profound positive impact thereon. In particular, we employed four application-aware objectives, identified earlier, alongside ML score to develop an MO framework, namely, PMAO, that leverages PASTA to generate a bunch of high- quality solutions that are considered equivalent in the context of conflicting objectives under consideration. Furthermore, PMAO leverages the power of machine learning to aid the domain experts in choosing the most appropriate tree from the PMAO output (containing a relatively large set of high-quality solutions). This aspect of PMAO, could be of independent interest in the context of MO based approaches in other domains as well.
MUSCLE is a general-purpose MSA tool widely used for its high throughput and accuracy. We again continue further on our original question (Q-A) and carefully equip MUSCLE with multiple application-aware objectives to enhance its capability to yield better trees. This thesis thus introduces MAMMLE, a framework for inferring better phy- logenetic trees from unaligned sequences by hybridizing MUSCLE with a multi-objective optimization strategy and leveraging multiple ML hypotheses. MAMMLE is an end-to- end approach for phylogeny estimation from unaligned sequences as well as a flexible framework whose components can potentially be modified, replaced, or further refined by bioinformatics researchers and practitioners.
Finally, we shift our focus to the species tree estimation from multi-locus genome- wide data, which is a complicated biological process and accurate estimation thereof has many challenges like the limited number of available gene trees, presence of gene tree estimation error etc. Consequently, even the statistically consistent phylogenomic methods may fail to reconstruct highly accurate trees under practical model conditions. With a goal to apply application-awareness in the context of a MO optimization, in this thesis, we present a MO metaheuristics algorithm (SNOGA), a modified version of the popular NSGAII, which combines various optimization criteria to find a suitable search space containing highly accurate species trees.