Abstract:
Estimating a species tree from biomolecular sequences is extremely difficult, especially when confronted with gene tree heterogeneity resulting from incomplete lineage sorting (ILS). Two of the most popular techniques for estimating species tree are: combined analysis (CA), which concatenates multiple sequence alignments of different genes into a single supergene alignment and then estimates a tree from this alignment, and another one is summary methods, which compute gene trees from different loci and then combine the inferred gene trees into a species tree. CA could be highly accurate in many cases as the combined gene alignments offer a high level of phylogenetic signals. However, it is agnostic about gene tree discordance (i.e., different genes having different evolutionary histories), leading to statistical inconsistency. On the other hand, summary methods can explicitly account for gene tree discordance and the underlying biological reasons, and thus could be statistically consistent. But they do not perform well when the number of genes is limited and the gene trees are not well estimated (i.e, gene tree estimation errors are prevalent). In this study, we have introduced a hybrid pipeline for species tree estimation that combines the strengths of both the combined analysis method and summary methods. Specifically, we have updated the process flow of a widely used quartet-based summary method called SVDquartets by combining SVDquartets with an existing technique called “binning” and a highly accurate quartet amalgamation technique wQFM. We assessed the performance of our proposed hybrid model on a collection of simulated and real biological datasets that cover a wide range of challenging model conditions with varying numbers of genes, amounts of gene tree estimation errors, and levels gene tree discordance. Our extensive evaluation studies on on both simulated and real biological datasets suggest that this hybrid model could be a promising approach for estimating species trees, especially in the presence of gene tree estimation error due to short gene sequences.