Dynamic Programming Algorithms

Overview

Dynamic programming can be employed as a means to align two sequences. The premise behind the algorithm is that an optimal alignment can be constructed by adding two aligned characters to a previously obtained optimal alignment. This insight means that it is not necessary to search all possible alignments in order to obtain the optimal one (given two N-lengh sequences, this amounts to time proportional to N^4N); rather, you must sequentially add characters to an optimal alignment that you have already constructed.

Optimal alignments are made using a sum matrix where any given element is given a score based on whether or not it represents a match, plus the score of the optimal alignment to which it could be added, minus a gap penalty incurred for adding spaces in the sequence. At the end, a traceback is performed where you start with the highest scoring element in the final column or row, and follow a path back to the end of the sequence such that every element contributing to the highest score is aligned. For two N-length sequences, this amounts to searching an average of N values for each of the N² spots in the matrix. This implies that the time will be proportional to N³ – a significant improvement over searching all possible alignments. However, given any cell, you have already calculated the maximum score for two cells adjacent to it, so only two comparisons need to be made for each matrix element, reducing the necessary time to O(N²).

Implementation

This project includes to programs: one which uses an algorithm that should work in N³ (“align”) time and one that should work in N² (“alignfast”) time by caching previous values.

Consider two N length strings (for simplicity), A and B. Both implementations create an N by N matrix of structures M, where M(i,j) is an alignment between the i^thelement of A and the j^thelement of B. Each structure contains the score, its i and j coordinates (for easy reference during traceback), and a pointer to the highest scoring element in M(i+1, j+1) to M(i+1, N) or M(i+1, j+1) to M(N, j+1). In the event of a tie, the element that introduces fewer gaps is chosen. If there are two equivalent paths, the “high road” (M(i+1, j+1) to M(i+1, N) ) is taken.

After the sum matrix is constructed, the maximum value in the top row or column closest to the origin (again, picking the top row in a tie) is chosen as the starting point. To complete the traceback, the pointer in the current high scoring structure is taken as the next alignment.

The only difference in the two implementations is that align searches each element along M(i+1, j+1) to M(i+1, N) or M(i+1, j+1) to M(N, j+1) for every matrix element; alignfast stores the previous high row and column scores, and then compares the stored scores with M(i+1,j+1) to determine the final score of the element. The high scores are stored in another matrix of structures of length N-1 by N-1 (since you don’t need to compare anything for edges) as the sum matrix is being computed.

Documentation

Included in the tar file are six c files. Utility.h and Utilityfast.h are header files for align and align fast, respectively, which contain structure definitions and function prototypes. Utility.c and utilityfast.c contain definitions for the functions. Align.c and alignfast.c contain the main source code. The tar also contains a Makefile so that “make”, invoked with no arguments produces the executables “align” and “alignfast”.

Both programs work identically with the exception of time. Two strings are taken from stdin, with a single carriage return separating the first from the second, and either a carriage return or “end-of-file” sign (which can be simulated by typing Ctrl-D at stdin) after the second. Previously stored strings in a text file can be supplied to the programs by an input redirect with the following syntax: align < your-text-file. (alignfast could also be used)

Output will consist of the two aligned strings (one above the other), with “|” linking aligned characters that are identical, and “*” linking aligned characters that don’t match. Spaces in the sequences are represents by “-“. After the aligned strings, the total score (representing the total number of matches with no gap penalties) is printed.

The programs can be timed using: time align < your-text-file. (type man time for details)

Outcome

In order to determine the length dependence of the programs, we can look at the ratio of the time it takes to run a program with strings of length N to the time for the program with strings of length 2N. If the program runs in O(N³), this ratio should be 2³= 8, since proportionality constants cancel in the ratio. For a program that takes O(N²) time, the ratio should be 2² = 4.

I created several test files (containing and equal number of “a”s in each string) to examine the different algorithms. The results were as follows:

The ratio for the align implementation is close to 8 as expected. Any discrepancy is likely due to quadratic and linear factors in the time function. In the limit as N approaches infinity, we expect these other factors to disappear (since N³+N²+N is approximately equal to N³ for large N); as N increases in the table above the factor does indeed get closer to 8.

The ratio for the alignfast implementation is close to 4, also as expected. Again, as N increases the ratio seems to get closer to 4; the anomalous ratio for the two smallest strings is likely a problem of precision since the time function only reports to hundredths of seconds.

Finally, since these programs are being run on a remote linux server, the percentage of CPU time being given to your particular terminal will alter the time they take to execute. For all of the above measurements, the time function returned a cpu percentage of about 100% +- .5%, so this consideration is probably unimportant.

2 sample runs:

cobra:~/dynamic% make

gcc -c align.c

gcc -c utility.c

gcc -o align align.o utility.o

gcc -c alignfast.c

gcc -c utilityfast.c

gcc -o alignfast alignfast.o utilityfast.o

cobra:~/dynamic% align

abcnyrqclcrpm

aycynrckcrbp

abcny-rqclcr-pm

|*| | | |*|| |

ayc-ynr-ckcrbp-

Score: 8

cobra:~/dynamic% time alignfast < test

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Score: 158

0.010u 0.010s 0:00.03 66.6% 0+0k 0+0io 87pf+0w

cobra:~/dynamic%