Abstract:
In this thesis we propose an advanced DNA sequence compression algorithm, which is efficient
and lossless. None of the available general-purpose compression algorithm compresses these
sequences. This is due to the specificity of the genetic sequences. It is believed that DNA
sequence encodes life in a nonrandom way. Some important sections of the sequences are
shifted, reserved, duplicated, or reversed during evolutions. Despite the apparently chaotic
randomness, those repeats undergo elementary mutations mostly during DNA replication in
successive cell divisions, and then become approximate repeats. This idea motivates us to
locate these segments. Our DNA compression algorithm consists of three phases. First, we
build a suffix tree to locate exact repeats. Then, we extend the exact repeats into approximate
repeats and the best possible combinations of non-overlapping repeats are found to compress
the DNA sequence. Finally, we use two sets of codes: two bit per base for exact and
approximate repeats, and arithmetic encoding for non-repeat regions. For encoding the numbers
in self-delimited way, the Fibonacci encoding method is used.
The time complexity of our algorithm is bounded by O(ISI +llkl\ where 151is the length of the
input sequence, I is the average length of repeats and Ikl is the maximum number of repeats
among all the repeat groups. We tested our proposed algorithm using some standard DNA test
Sequences and the result shows that our algorithm do better than the Cfact, GenCompress and
some other algorithm by several orders of magnitude. Moreover, the proposed algorithm can be
used to compress long sequences with millions of bases or more.