Abstract:
The consensus problems in strings is motivated by the requirement of nding commonality of a large number of strings and has a variety of applications in Bioinformatics. This thesis presents important theoretical and algorithmic results determining the complexity class of the consensus string problem and provides a road map for diagnosing unknown genetic diseases that show Allelic Heterogeneity, a case where a normal gene mutates in di erent orders resulting in two different gene sequences causing two di erent genetic diseases.
In this thesis, we rst show the NP-hardness of the consensus string problem under a well known mutation type, namely transposition as the distance metric. Then we propose a polynomial time algorithm for the relaxed version of the problem which determines the existence of a consensus sequence given two input sequences under the inversion and trans-position metric. Our algorithm detects the existence of a common ancestor gene sequence given two input DNA sequences with theoretical worst case time complexity of O(n4) for both the non-overlapping inversion (reversed complement) metric and transposition met-ric. Here n is the common length of the input sequences. However, for both the inversion and transposition metric, practically the average and worst case time complexity have been found to be O(n2) and O(n3) respectively, where the worst case occurs when both input sequences have similarity of around 90%. Similarly, theoretical worst case space complexity is O(n3) for both the inversion and transposition metric, whereas it is O(n2) practically for the inversion metric. Finally, we present a pathway of detecting Allelic Heterogeneity, a challenging genetic disease, using our algorithm.