Description

[ Home ] [ Description ] [ Instructions ] [ Source Code ] [ Samples ] [ Applet ]

Description

The SequenceShop 1.0b utility, written in JAVA (Sun JDK 1.2.1), implements dynamic programming methods [2] for aligning sequences. It accepts any ASCII characters as input and thus can be used for both DNA sequence and protein sequence alignments. SequenceShop provides options for automatic global alignment, automatic local alignment or custom alignment based on all parameters explicitly specified by the user. In addition, the program allows for both case-sensitive and all-uppercase alignments. This utility functions as a graphical applet and it can be accessed through the Internet from any platform provided that the browser supports the JAVA 1.2 language (Netscape Communicator version 4.5 or higher is required -- the program may or may not work with Internet Explorer 4 and/or 5).

SequenceShop implements a combination of the basic Needleman-Wunsch algorithm [2] and the Smith-Waterman approach [3] for local alignment. The heart of the program lies in the "Matrix" object which allocates memory for the two dimensional number array on which the "Matrix" functions perform alignment computations. The two sequences entered by the user define the rows and columns of the matrix. First, the program computes a "Similarity Matrix" where all characters from the first sequence are compared against each character in the other sequence. The match and mismatch values can be specified by the user. They are typically "1" and "0", respectively, for global alignment and "1" and "-0.6" for locally sensitive alignment. The negative scores for mismatches in the local alignment allow the scores to start decreasing once the algorithm passes a region of high local similarity. Next, the program computes the "Sum Matrix" starting with the bottom right cell. SequenceShop subtracts the gap penalties (based on user’s values for gap opening and gap extension parameters) as it computes the sum matrix. The maximum score for the matrix can end up anywhere in the matrix depending on what value for mismatch penalty is specified. For the sum matrix computation algorithm and the gap penalty formula see reference [1].

The traceback algorithm starts with the position of the maximum overall score and traces back through the highest scores via two traceback "branches": first to the bottom and then to the top of the maximum score. This programming construct makes it possible to use one function for calculating both global and local alignments. In the case of global alignment (mismatch penalty > 0) only the bottom "branch" needs to be computed because the maximum score is in the top row. As the program traces back through the matrix it fills three dynamic strings with aligned characters from the two sequences and creates a third "connecting" string in between the two. These three strings constitute the final output of the program.

SequenceShop 1.0b does not utilize caching methods for speed optimization and the computational time required is proportional to N² (if sequences are of approximately the same length N). Implementing a caching method in the future version of the program could increase the speed making it proportional to N. With the current algorithm, sequences of length N = 500 can be aligned in about 5 seconds on a Pentium 200MHz processor with 32Mb of RAM. Sequences of length < 300 are aligned almost instantaneously.

Instructions for installing SequenceShop:

1) A copy of the full project can be accessed at:

http://pantheon.yale.edu/~mjr33

Note: A graphical browser (Lynx will not work) must be used.

2) Use the UNIX command "tar" utility to extract files:

tar -xvf "name of tar file"

This will extract all of the html and other support files into the
directory structure ready for display on the Web. This also extracts the
compiled .class files into the appropriate directory.

3) All files and directories pertaining to the project must be
set for READ and EXECUTE access for outsiders.

2) If there is a need to recompile from source code type in:

javac SequenceShop.java

This will produce five .class files necessary for the applet to
run. This action should NOT be necessary because the tar file already
contains compiled .class files in the /class_files directory.

NOTE: This version of the program is written for Web use and cannot be
run under UNIX (Pantheon).

Netscape Communicator 4.5 or higher must be used to view the applet.
(or any other browser supporting JAVA 1.2)

index.html - main page
descript.html - description of the project
instruct.html - instructions on how to use the program
samples.html - sample runs of the program (as .jpg image files)
source.html - contains the source code of SequenceShop in JAVA
applet.html - page with the actual, running SequenceShop

SequenceShop.java - source code
readme.txt - installation instructions

/class_files - directory with compiled .class files to run the program
/images - directory with the sample run images and background image

References

1. Chapter 3 from Gribskov, M. and Devereux, J. (1992). Sequence Analysis Primer. New York, Oxford University Press.

2. Needleman, S. B. and Wunsch, C. D. (1971). "A general method applicable to the search for similarities in the amino acid sequence of two proteins." J. Mol. Biol. 48: 443-453.

3. Smith, T. F. and Waterman, M. S. (1981). "Identification of common molecular subsequences." J. Mol. Biol. 147: 195-197.