DSpace Repository

Advanced DNA sequence compression algorithm

Show simple item record

dc.contributor.advisor Kaykobad, Dr. Mohammad
dc.contributor.author Sadia Sharmin
dc.date.accessioned 2015-12-19T09:40:27Z
dc.date.available 2015-12-19T09:40:27Z
dc.date.issued 2009-12
dc.identifier.uri http://lib.buet.ac.bd:8080/xmlui/handle/123456789/1545
dc.description.abstract In this thesis we propose an advanced DNA sequence compression algorithm, which is efficient and lossless. None of the available general-purpose compression algorithm compresses these sequences. This is due to the specificity of the genetic sequences. It is believed that DNA sequence encodes life in a nonrandom way. Some important sections of the sequences are shifted, reserved, duplicated, or reversed during evolutions. Despite the apparently chaotic randomness, those repeats undergo elementary mutations mostly during DNA replication in successive cell divisions, and then become approximate repeats. This idea motivates us to locate these segments. Our DNA compression algorithm consists of three phases. First, we build a suffix tree to locate exact repeats. Then, we extend the exact repeats into approximate repeats and the best possible combinations of non-overlapping repeats are found to compress the DNA sequence. Finally, we use two sets of codes: two bit per base for exact and approximate repeats, and arithmetic encoding for non-repeat regions. For encoding the numbers in self-delimited way, the Fibonacci encoding method is used. The time complexity of our algorithm is bounded by O(ISI +llkl\ where 151is the length of the input sequence, I is the average length of repeats and Ikl is the maximum number of repeats among all the repeat groups. We tested our proposed algorithm using some standard DNA test Sequences and the result shows that our algorithm do better than the Cfact, GenCompress and some other algorithm by several orders of magnitude. Moreover, the proposed algorithm can be used to compress long sequences with millions of bases or more. en_US
dc.language.iso en en_US
dc.publisher Department of Computer Science and Engineering, BUET en_US
dc.subject Algorithms - DNA sequence en_US
dc.title Advanced DNA sequence compression algorithm en_US
dc.type Thesis-MSc en_US
dc.contributor.id 100605052 P en_US
dc.identifier.accessionNumber 107472
dc.contributor.callno 006.31/SAD/2009 en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search BUET IR


Advanced Search

Browse

My Account