Alignment Methods MB&B 452a/752a2 12/9/99 Sang-Won Kim Introduction Since the genome sequencing projects for many organisms have been launched, enormous amount of sequence information produced make it impossible to analyze the data manually and computational methods became inevitable. Establishing a significant sequence relationship usually has several goals. Most molecular biologists want to know if a similarity is significant enough to predict the function of a newly discovered sequence by analogy to the sequences of known function. The other goal of sequence comparison is to understand how sequences have evolved. And a third would be to find the adjacent sequences or full length cDNA by using only part of it. Sequence Aligenment Dot matrix method(1) The very first method for molecular sequence analysis was dot matrix method. To compare two sequences, A and B, sequence A is written out vertically with each letter(residue or base) representing a row and sequence B is written out horizontally with each letter representing a column. Then, each letter of sequence A is compared to each letter of sequence B and a dot is placed at the intersection of each row and column above the letters are identical. One then visually inspects the matrix for meaningful pattern of dots that represent certain sequence features. Dynamic programming If the sequences are assumed to be approximately for the same length(N), then time proportional to N2 is required in dot matrix method. To account for the presence of gaps, the calculation would be 2N times to examine the possibility of gaps in each position of each sequence, for time proportional to N4N. Needleman and Wunsch introduced dymanic programming as a more efficient way of aligning sequences. The dynamic programming method requires only time proportional to N2. It uses a scoring system that takes into account the physical, chemical, structural and evolutionary similarity of amino acid residues. The best alignment that ends at a given pair of bases or residues is the best alignment of the sequences up to that point, plus the score for aligning the two additional bases or residues.(2) And gap penalty is calculated based on both of a length-dependent term(cost of extending the gap) and a length-independent term(cost of opening). Even though this algorithm was initially designed for the global alignment which find alignment in which total score is highest, perhaps at expense of areas of great local similarity, some modification can make it useful for the local alignment which find the highest scoring subsequences at the expense of the overall score.(3) FASTA and FASTP programs Because the dynamic programming algorithm requires order N2 operations, the use of this algorithm for database searching on currently available PC and workstations is hard. An alternative strategy which greatly accelerates the initial search speed while still functioning a standard processes involves the use of hashing. This is accomplished by building a table of the locations of short subsequences or words, in the query sequence. Segments from the database are then sequentially tested against this hash or neighborhood table. FASTA and FASTP programs use this type of algorithm. In the case of FASTA, the hash table is built to require exact matches between the query and target sequence over the full length of the hash entry. The size of the hash table entry(word size) is adjustable as an input parameter(ktuple). The use of larger ktuple will increase the speed but decrease the sensitivity.(4) BLAST(Basic Local Alignment Search Tool) BLAST examine only ungapped segments but ranks matches statistically.(5) 'Expect' probability values represent the calculated probability of finding similarities with this significance in the entire target database. While the BLAST approach is approximate in the sense that the possibility of gapped alignment is ignored, the statistical ranking and significance tests outweigh this disadvantage. BLAST first scans the database for words(hash) that score at least T(threshold) when aligned and then extend the aligned word pair in both directions until the aligned score doesn't increase. This extension step consumes most of the processing time. Gapped BLAST It requires the existence of two non-overlapping word pairs on the same diagonal, and within a distance A of one another before an ungapped extension is invoked. Because only a small fraction of the hits are extended, the computation required decrease. And the ability to generate gapped alignments has been added in this program and it made necessary to find only one rather than all the ungapped alignments in the significant result.(6) The resulting gapped alignment is reported only if it has an E-value low enough to be of interest. PSI-BLAST(Position-Specific Iterated BLAST) Motif or profile search methods(using position-specific score matrices) frequently are much more sensitive than pairwise comparison methods at detecting distant relationships. However, creating a set of motifs or a profile that describes a protein family, and searching a database with them, typically has involved running several different programs, with substantial user intervention at various stages They have automated the procedure of generating an arbitrary position-specific score matrix from the output produced by a BLAST search and adapted the BLAST algorithm to take this matrix as input. The resulting position-specific iterated BLAST may not be as sensitive as the best available motif search programs, but its speed and ease of operation can bring the power of these methods into common use.(6) Multiple Alignment In addition to displaying the relationships among a group of sequences, multiple alignment is widely used in phylgenetic analysis, protein modeling and structure prediction and the detection and quantitation of sequence patterns or motifs. Progressive alignment methods In this approach, each pair of sequences is aligned to give a measure of the similarity of the sequence pair. The sequences are then grouped into clusters based on similarity. The multiple alignment is produced by performing a pairwise alignment of the two most closely related sequences and then adding in additional sequences or clusters of sequences, one at a time according to the clustering pattern. This method is sensible biologically and has the great advantage of speed and simplicity even though it may not produce alignment with mathematical property.But the intrinsic problem is that any mistakes that appear during early alignments in a progressive multiple alignment cannot be corrected later as new sequence information is added.(Local minimum problem). Another problem is that it uses just one set of parameters(normally an amino acid substitution matrix and two gap penalties) and hopes that they will do for all(parameter choice problem)(7). Structure Alignment Structure alignment consists of establishing equivalences between the residues in two different proteins, as in sequence alignment. However, this equivalence is determined mainly on the basis of the three-dimensional coordinates corresponding to each residue, not on the basis of the amino acid identity. Alignment method is very much like classic Needleman-Wunsch sequence alignment. It consists of building a similarity matrix Sij based on the interatomic distances between each atom i in the first structure and each atom j in the second structure. Then dynamic programming is used to this matrix to find the optimal global alignment. d (distance between atom of i and j)= (Xi-Xj)2+(Yi-Yj)2+(Zi-Zj)2 M (i,j)=100/(5+d2) Whereas in the sequence alignment the optimum match for one part of a sequence is not affected by the match for any other part, the alignment in a structural alignment algorithm can depend on the initial equivalences. That is, in the structural alignment, the matrix depends on the relative 3D positioning of the two structures, which in turn, affected by how they have been previously aligned, so the procedure must be iterated until it converges.(8) References 1 .Chapter 3 from Gribskov, M. and Devereux, J. (1992). Sequence Analysis Primer. New York, Oxford University Press. 2. Needleman, S. B. and Wunsch, C. D. (1971). "A general method applicable to the search for similarities in the amino acid sequence of two proteins." J. Mol. Biol. 48:443-453. 3. Smith, T. F. and Waterman, M. S. (1981). "Identification of common molecular subsequences." J. Mol. Biol. 147: 195-197 4 .Pearson, W.,R., Lipman, D.,J. " Improved tools for biological sequence comparison. " Proc Natl Acad Sci U S A. 1988 Apr;85(8):2444-8. 5. Altschul, S., Gish, W., Miller,W., Myers, E. W. and Lipman, D. J.(1990). "Basic local alignment search tool" J. Mol. Biol 215: 403-410 6. Alschul et al. (1998). "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs," Nucleic Acids Res 1997 Sep 1;25(17):3389-402 7 .Higgins, D. G., Thompson, J. D. & Gibson, T. J. (1996). "Using CLUSTAL for multiple sequence alignments," Methods Enzymol 266, 383-402.b 8 .M Gerstein & M Levitt (1998). "Comprehensive Assessment of Automatic Structural Alignment against a Manual Standard, the Scop Classification of Proteins," Protein Science 7: 445-456.