|
Description
SequenceShop implements a combination of the basic Needleman-Wunsch algorithm [2] and the Smith-Waterman approach [3] for local alignment. The heart of the program lies in the "Matrix" object which allocates memory for the two dimensional number array on which the "Matrix" functions perform alignment computations. The two sequences entered by the user define the rows and columns of the matrix. First, the program computes a "Similarity Matrix" where all characters from the first sequence are compared against each character in the other sequence. The match and mismatch values can be specified by the user. They are typically "1" and "0", respectively, for global alignment and "1" and "-0.6" for locally sensitive alignment. The negative scores for mismatches in the local alignment allow the scores to start decreasing once the algorithm passes a region of high local similarity. Next, the program computes the "Sum Matrix" starting with the bottom right cell. SequenceShop subtracts the gap penalties (based on users values for gap opening and gap extension parameters) as it computes the sum matrix. The maximum score for the matrix can end up anywhere in the matrix depending on what value for mismatch penalty is specified. For the sum matrix computation algorithm and the gap penalty formula see reference [1]. The traceback algorithm starts with the position of the maximum overall score and traces back through the highest scores via two traceback "branches": first to the bottom and then to the top of the maximum score. This programming construct makes it possible to use one function for calculating both global and local alignments. In the case of global alignment (mismatch penalty > 0) only the bottom "branch" needs to be computed because the maximum score is in the top row. As the program traces back through the matrix it fills three dynamic strings with aligned characters from the two sequences and creates a third "connecting" string in between the two. These three strings constitute the final output of the program. SequenceShop 1.0b does not utilize caching methods for speed optimization and the computational time required is proportional to N2 (if sequences are of approximately the same length N). Implementing a caching method in the future version of the program could increase the speed making it proportional to N. With the current algorithm, sequences of length N = 500 can be aligned in about 5 seconds on a Pentium 200MHz processor with 32Mb of RAM. Sequences of length < 300 are aligned almost instantaneously.
Instructions for installing SequenceShop:
References 1. Chapter 3 from Gribskov, M. and Devereux, J. (1992). Sequence Analysis Primer. New York, Oxford University Press. 2. Needleman, S. B. and Wunsch, C. D. (1971). "A general method applicable to the search for similarities in the amino acid sequence of two proteins." J. Mol. Biol. 48: 443-453. 3. Smith, T. F. and Waterman, M. S. (1981). "Identification of common molecular subsequences." J. Mol. Biol. 147: 195-197. |