Dynamic
Programming Algorithms
Dynamic programming can be employed as a means to
align two sequences. The premise behind
the algorithm is that an optimal alignment can be constructed by adding two
aligned characters to a previously obtained optimal alignment. This insight means that it is not necessary
to search all possible alignments in order to obtain the optimal one (given two
N-lengh sequences, this amounts to time proportional to N4N);
rather, you must sequentially add characters to an optimal alignment that you
have already constructed.
Optimal
alignments are made using a sum matrix where any given element is given a score
based on whether or not it represents a match, plus the score of the optimal
alignment to which it could be added, minus a gap penalty incurred for adding
spaces in the sequence. At the end, a
traceback is performed where you start with the highest scoring element in the
final column or row, and follow a path back to the end of the sequence such
that every element contributing to the highest score is aligned. For two N-length sequences, this amounts to
searching an average of N values for each of the N2 spots in the
matrix. This implies that the time will
be proportional to N3 – a significant improvement over searching all
possible alignments. However, given any
cell, you have already calculated the maximum score for two cells adjacent to
it, so only two comparisons need to be made for each matrix element, reducing
the necessary time to O(N2).
This project includes to
programs: one which uses an algorithm that should work in N3
(“align”) time and one that should work in N2 (“alignfast”) time by
caching previous values.
Consider two N length strings (for simplicity), A and
B. Both implementations create an N by
N matrix of structures M, where M(i,j) is an alignment between the ith element
of A and the jth element of B.
Each structure contains the score, its i and j coordinates (for easy
reference during traceback), and a pointer to the highest scoring element in
M(i+1, j+1) to M(i+1, N) or M(i+1, j+1) to M(N, j+1). In the event of a tie, the element that introduces fewer gaps is
chosen. If there are two equivalent
paths, the “high road” (M(i+1, j+1) to M(i+1, N) ) is taken.
After the sum matrix is constructed, the maximum value
in the top row or column closest to the origin (again, picking the top row in a
tie) is chosen as the starting point.
To complete the traceback, the pointer in the current high scoring structure
is taken as the next alignment.
The only difference in the two
implementations is that align searches each element along M(i+1, j+1) to M(i+1,
N) or M(i+1, j+1) to M(N, j+1) for every matrix element; alignfast stores the
previous high row and column scores, and then compares the stored scores with
M(i+1,j+1) to determine the final score of the element. The high scores are stored in another matrix
of structures of length N-1 by N-1 (since you don’t need to compare anything
for edges) as the sum matrix is being computed.
Included in the tar file are six c files. Utility.h and Utilityfast.h are header files
for align and align fast, respectively, which contain structure definitions and
function prototypes. Utility.c and
utilityfast.c contain definitions for the functions. Align.c and alignfast.c contain the main source code. The tar also contains a Makefile so that
“make”, invoked with no arguments produces the executables “align” and “alignfast”.
Both programs work identically
with the exception of time. Two strings
are taken from stdin, with a single carriage return separating the first from
the second, and either a carriage return or “end-of-file” sign (which can be
simulated by typing Ctrl-D at stdin) after the second. Previously stored strings in a text file can
be supplied to the programs by an input redirect with the following
syntax: align < your-text-file. (alignfast could also be
used)
Output will consist of the two aligned strings (one
above the other), with “|” linking aligned characters that are identical, and
“*” linking aligned characters that don’t match. Spaces in the sequences are represents by “-“. After the aligned strings, the total score
(representing the total number of matches with no gap penalties) is printed.
The programs can be timed using: time align <
your-text-file. (type man time for details)
In order to determine the length dependence of the
programs, we can look at the ratio of the time it takes to run a program with
strings of length N to the time for the program with strings of length 2N. If the program runs in O(N3),
this ratio should be 23 = 8, since proportionality constants cancel
in the ratio. For a program that takes
O(N2) time, the ratio should be 22 = 4.
I created several test files (containing and equal
number of “a”s in each string) to examine the different algorithms. The results were as follows:
The ratio for the align implementation is close to 8
as expected. Any discrepancy is likely
due to quadratic and linear factors in the time function. In the limit as N approaches infinity, we
expect these other factors to disappear (since N3+N2+N is
approximately equal to N3 for large N); as N increases in the table
above the factor does indeed get closer to 8.
The ratio for the alignfast
implementation is close to 4, also as expected. Again, as N increases the ratio seems to get closer to 4; the
anomalous ratio for the two smallest strings is likely a problem of precision
since the time function only reports to hundredths of seconds.
Finally, since these programs are
being run on a remote linux server, the percentage of CPU time being given to
your particular terminal will alter the time they take to execute. For all of the above measurements, the time
function returned a cpu percentage of about 100% +- .5%, so this consideration
is probably unimportant.
2
sample runs:
cobra:~/dynamic% make
gcc -c align.c
gcc -c utility.c
gcc -o align align.o utility.o
gcc -c alignfast.c
gcc -c utilityfast.c
gcc -o alignfast alignfast.o utilityfast.o
cobra:~/dynamic% align
abcnyrqclcrpm
aycynrckcrbp
abcny-rqclcr-pm
|*| | | |*|| |
ayc-ynr-ckcrbp-
Score: 8
cobra:~/dynamic% time alignfast < test
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Score: 158
0.010u 0.010s 0:00.03 66.6% 0+0k 0+0io 87pf+0w
cobra:~/dynamic%