1.1.1 compression. The sequence that was compressed

1.1.1     
GenCompress

In 1999 Xin Chen, Sam
Kwong and Ming Li 37-58 came up with a new genetic Compression algorithms
known as Gencompress. The Gencompress algorithm used a reference sequence for
compression. The sequence that was compressed was compared with the reference
sequence and approximate repeats and repeal compliments were searched. This
where then encoded with three factors which were as follows:

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

·      
Length

·      
Position

·      
And finally the error factor

The Gencompress algorithm was giving better
compression ratio than the Biocompress and Cfact compression algorithms. There
were two varieties of this compression algorithm, the Gencompress1 and the
Gencompress2.

Gencompress1 uses the hamming distances only
for substitution whereas Gencompress2 exploited the edited distance. Unlike Cfact
Gencompress was a one pass algorithm

In Gencompress the edit operation worked
something like the following:

The Gencompress Algorithm considered the 3
standard edit operations for approximate matches and these operations were

·     
Replace: Expressed as
(R, p, char) this can be interpreted as, replacing the character at position p
by character char.

·     
Insert: Expressed as
(I, p, char), this can be interpreted as, inserting character char at p.

·     
Delete: Expressed as
(D, p), this can be interpreted as, deleting the character at position p.

Gencompress 37 was a better compression Algorithm than Biocompress
and Cfact in terms of the compression ratio but this Algorithm also did not
provide any facility of partial decompression and inter sequence similarity
search.

 

1.1.2     
GenomeCompress

The next in algorithm genome compress was
developed in 2005 by Umesh Ghoshdastider and Bannani Saha 59-71. Genome
compressed used a very simple algorithm for compressing the biological data and
was used for compressing both repetitive and non-repetitive sequences. Because
genome compress was non dynamic in nature hence it used very less execution
time and memory. The developer of GenomeCompress, claims that the algorithm is approximately
giving 22% compression i.e. it reduces the size of each base to 1.76 bpb. GenomeCompress is a single pass algorithm which starts by
dividing the complete sequence into segments of 4 base pairs. Genome compress
uses a set of 5 bit binary codes to replace each segment. Because the algorithm
is using 5 bit binary code hence there is a possibility of 32 different
combinations whereas the biological DNA data has only 4 base pairs. Hence if a
combination of 4 base pairs is created then 24 such segments will be required. Out
of eight five bit binary numbers, four are utilized to encode repetitive DNA
sequence and the remaining is used to encode each bases of segment.

An
example from the Base paper 59 is as follows :

Let
us consider the sequence GAAT TTGC AAAA AAAA GCTA ATGC CTAG GGTT TTTG CCCC CCCC
AAAA TCAG TTGC ATAG GACG . This sequence is of length 64 and 64 bytes are
required to store it in a text file. If it is compressed using zip program
included in Windows XP, the size becomes 163 bytes which is not intended at
all. This proves that it is very difficult of compress a DNA sequence and
general compression programs just increase its size while compressing. By GenomeCompress
every segment of length four is replaced by a 5 bit number and eight
repetitive A,T, G and C are replaced by four unique 5 bit numbers. Two AAAAAAAA
and CCCC CCCC are found in this sequence. So 5*12+5*2 bits or 70 bits or just 9
bytes are required for compressed sequence by GenomeCompress.

 

 

1.1.3     
DNACompress

DNACompress was
developed by Xin Chen1 , Ming Li, Bin Ma and John
Tromp in the year 2002 72- 78 DNA compress is a two-phase
algorithm. DNA compress uses of third party tools for finding different type of
repeat. This tool is known as Pattern Hunter 72 The two phases involved in
DNACompress are as follows:

·      
In the first phase the pattern hunter searches
for complementary palindromes and approximate receipts. Each of these repeat
are given a score. The steps are as follows:

o  
Run PatternHunter and
output all approximate repeats (and approximate reverse complements) into a
list A in the order of descending scores;

o  
Extract a repeat r with
highest score from list A and add r into another repeat list B;

o  
Process each repeat in
A so that there’s no overlap with the extracted repeat r;

o  
Goto step 2 if the
highest score of repeats in A is still higher than a pre-defined threshold;
otherwise exit.

·      
In the second phase the repeat with highest
score are encoded. The steps are as follows:

o  
One bit to show which
kind of repeat it is, forward repeat or complemented palindrome.

o  
A triple (l,i, j). It
is used to copy a previous substring of length l starting at i to the current
position j;

o  
The total number of
edit operations contained in this approximate repeat;

o  
All triples (e, o, b).
They are used to edit the copied substring. Instead of encoding each edit
operation separately, a consecutive region of the same edit operation (or say a
block edit operation) can alternatively employ a more efficient encoding
method.

DNACompress check each repeat and finds out
that weather after compression the size of the repeat will be less than 2 bits
per base. If this condition is true then it is compressed otherwise it is
discarded. At the end of the algorithm all the non repeat bases and discarded
bases are concatenated together and in encoded.

DNACompress claims to achieve compression of 1.72 bits per base.
But this algorithm also doesn’t provide any feature of partial decompression
and inter chromosomal similarity search.