Advanced DNA sequence compression algorithm

BUET ILS
BUET Institutional Repository: Home
→
Dissertations/Theses
→
Dissertations/Theses - Department of Computer Science and Engineering
→
View Item

Advanced DNA sequence compression algorithm

Sadia Sharmin

URI: http://lib.buet.ac.bd:8080/xmlui/handle/123456789/1545

Date: 2009-12

Abstract:

In this thesis we propose an advanced DNA sequence compression algorithm, which is efficient and lossless. None of the available general-purpose compression algorithm compresses these sequences. This is due to the specificity of the genetic sequences. It is believed that DNA sequence encodes life in a nonrandom way. Some important sections of the sequences are shifted, reserved, duplicated, or reversed during evolutions. Despite the apparently chaotic randomness, those repeats undergo elementary mutations mostly during DNA replication in successive cell divisions, and then become approximate repeats. This idea motivates us to locate these segments. Our DNA compression algorithm consists of three phases. First, we build a suffix tree to locate exact repeats. Then, we extend the exact repeats into approximate repeats and the best possible combinations of non-overlapping repeats are found to compress the DNA sequence. Finally, we use two sets of codes: two bit per base for exact and approximate repeats, and arithmetic encoding for non-repeat regions. For encoding the numbers in self-delimited way, the Fibonacci encoding method is used. The time complexity of our algorithm is bounded by O(ISI +llkl\ where 151is the length of the input sequence, I is the average length of repeats and Ikl is the maximum number of repeats among all the repeat groups. We tested our proposed algorithm using some standard DNA test Sequences and the result shows that our algorithm do better than the Cfact, GenCompress and some other algorithm by several orders of magnitude. Moreover, the proposed algorithm can be used to compress long sequences with millions of bases or more.

Show full item record

Files in this item

Name: Full Thesis .pdf

Size: 681.5Kb

Format: PDF

View/Open

This item appears in the following Collection(s)

Dissertations/Theses - Department of Computer Science and Engineering
Post graduate dissertations (Theses) of Computer Science Engineering (CSE)

Advanced DNA sequence compression algorithm

Advanced DNA sequence compression algorithm

Abstract:

Files in this item

This item appears in the following Collection(s)

Search BUET IR

Browse

All of IR

This Collection

My Account