Alignment Methods                MB&B 452a/752a2
                                                   12/9/99   Sang-Won Kim
Introduction
Since the  genome sequencing   projects  for many  organisms  have  been  launched, 
enormous amount of sequence information produced  make  it impossible to analyze the   
data manually and computational methods became inevitable.
Establishing a  significant  sequence  relationship  usually  has  several   goals. Most  
molecular biologists want to know  if a similarity  is significant enough to  predict  the 
function  of  a newly  discovered  sequence by  analogy to  the  sequences of  known  
function. The other goal of sequence comparison is to understand   how sequences have 
evolved. And  a third would be to find the  adjacent sequences or full length cDNA by 
using only part of it.

Sequence Aligenment
Dot matrix method(1)
The very first  method for  molecular sequence  analysis was  dot matrix  method. To 
compare two sequences, A  and  B, sequence A   is written out  vertically  with each  
letter(residue or  base)  representing  a  row   and sequence   B   is written   out 
horizontally  with each  letter representing a  column.  Then,  each letter  of sequence  
A is  compared to  each letter  of sequence B and  a dot is  placed at  the intersection 
of  each row  and column above  the letters  are identical. One  then visually inspects 
the  matrix for meaningful pattern  of dots that represent certain sequence features. 

Dynamic programming
If the  sequences are  assumed to   be approximately for   the same  length(N),  then 
time proportional to N2 is required in  dot matrix method. To account  for the presence 
of gaps, the calculation would be  2N times to examine  the possibility of gaps  in each 
position  of each sequence, for time proportional to N4N. 
Needleman and  Wunsch introduced   dymanic programming as   a more efficient  way  
of aligning sequences. The dynamic programming method requires only time proportional 
to N2. It uses  a  scoring system  that  takes  into account  the  physical,  chemical, 
structural  and evolutionary similarity of  amino acid residues. The best  alignment that 
ends at a given pair of bases or residues  is the best alignment of the sequences up to 
that point, plus the score for aligning  the two additional bases or residues.(2) And  gap 
penalty is calculated  based on both  of  a length-dependent   term(cost  of  extending  
the gap)   and a length-independent term(cost of opening). Even though  this algorithm  
was initially  designed  for   the global  alignment  which  find alignment in   which 
total   score is highest,    perhaps  at expense  of  areas   of great local  similarity, 
some modification  can  make   it useful   for the  local  alignment which  find  the 
highest scoring subsequences at the expense of the overall score.(3)

FASTA and FASTP programs
Because the dynamic programming algorithm  requires order N2 operations,  the  use of 
this algorithm for database searching  on  currently available  PC and workstations  is 
hard. An alternative strategy  which greatly accelerates the  initial search speed  while 
still functioning a standard processes involves the use of  hashing. This is accomplished  
by building  a table  of the  locations of  short subsequences  or words,  in the  query 
sequence. Segments from the database are then sequentially tested against this  hash or 
neighborhood table. FASTA and  FASTP programs use  this type of  algorithm. In the 
case of FASTA, the hash table is built to require exact matches between the query and 
target sequence over the full length of the hash entry.   The size   of the   hash table  
entry(word size)   is adjustable  as   an input parameter(ktuple). The   use  of larger  
ktuple  will increase   the speed but   decrease the sensitivity.(4)

BLAST(Basic Local Alignment Search Tool)
BLAST examine   only  ungapped  segments  but   ranks  matches  statistically.(5) 
'Expect' probability   values   represent  the   calculated probability     of finding   
similarities with   this  significance in  the entire target  database. While  the BLAST 
approach is approximate in  the sense that   the  possibility of  gapped   alignment is   
ignored, the statistical   ranking and significance  tests  outweigh   this  disadvantage.  
BLAST  first  scans  the  database  for words(hash) that score  at least  T(threshold) 
when  aligned and  then  extend  the aligned word  pair in both  directions  until the 
aligned score doesn't  increase. This extension  step consumes  most of the  processing 
time.

Gapped BLAST
It requires the  existence of  two non-overlapping  word  pairs on  the same diagonal,  
and within a  distance A  of  one  another before  an ungapped extension  is invoked. 
Because only a  small fraction  of the  hits are   extended, the  computation  required 
decrease.  And the ability to  generate  gapped alignments   has been  added in   this 
program  and it  made necessary  to  find only  one rather   than all   the ungapped 
alignments  in the  significant result.(6) The resulting gapped alignment is reported only 
if it  has an E-value low enough to be of interest.

PSI-BLAST(Position-Specific Iterated BLAST)
Motif or profile search  methods(using position-specific score   matrices) frequently are 
much more  sensitive  than   pairwise  comparison  methods   at  detecting  distant  
relationships. However, creating  a set  of motifs or  a profile  that describes a  protein 
family, and searching a database  with them,  typically   has involved  running several  
different  programs, with substantial user  intervention at  various  stages They  have 
automated   the procedure of generating  an  arbitrary  position-specific score   matrix  
from the   output produced by   a BLAST search  and adapted  the  BLAST algorithm  
to take   this matrix as  input.   The resulting position-specific iterated  BLAST may 
not  be as sensitive  as the best  available  motif search programs, but its  speed and 
ease of  operation can bring the  power of these methods into common use.(6)

Multiple Alignment
In addition  to displaying  the relationships   among a  group of  sequences,  multiple 
alignment is  widely used  in  phylgenetic  analysis, protein   modeling and   structure  
prediction and the detection and quantitation of sequence patterns or motifs.

Progressive alignment methods
In this approach, each pair of  sequences is aligned to give  a measure of the similarity  
of the  sequence  pair. The  sequences   are then   grouped into  clusters   based on 
similarity.  The multiple alignment is  produced by performing a  pairwise alignment of 
the two most   closely related  sequences and then  adding in  additional sequences or 
clusters of sequences, one at  a time according to the   clustering pattern. This method  
is sensible biologically  and  has  the great advantage  of speed    and simplicity even  
though it may   not produce  alignment with mathematical  property.But  the  intrinsic 
problem is that any mistakes that   appear during  early  alignments  in  a progressive 
multiple alignment  cannot   be corrected   later as   new sequence   information  is 
added.(Local  minimum  problem).  Another  problem  is  that  it  uses  just  one  set  
of parameters(normally an amino acid  substitution matrix and  two gap penalties)  and 
hopes that they will do for all(parameter choice problem)(7).


Structure Alignment
Structure alignment  consists of   establishing equivalences  between   the residues  in  
two  different  proteins,  as  in   sequence   alignment. However,   this  equivalence   
is determined mainly on  the basis of  the three-dimensional coordinates  corresponding 
to each residue,  not on the basis of the amino acid identity.
Alignment method is  very much  like classic  Needleman-Wunsch sequence  alignment. 
It consists of building a similarity matrix Sij based on the interatomic distances between 
each atom i  in the  first structure  and each  atom j  in the  second structure.  Then 
dynamic programming is used to this matrix to find the optimal global alignment.

d (distance between atom of i and j)=  (Xi-Xj)2+(Yi-Yj)2+(Zi-Zj)2
M (i,j)=100/(5+d2)

Whereas in the sequence alignment the  optimum match for one part  of  a sequence is 
not affected by the match for  any other part, the alignment  in a  structural alignment 
algorithm can  depend  on  the  initial   equivalences. That   is,  in  the  structural   
alignment, the  matrix depends on the  relative  3D positioning of  the  two structures,  
which in turn,   affected by  how they   have  been  previously   aligned, so   the 
procedure  must  be iterated   until it converges.(8)

References
1 .Chapter 3 from  Gribskov, M. and  Devereux,  J. (1992). Sequence Analysis  Primer. 
New York, Oxford University Press.

2. Needleman, S. B.  and Wunsch, C.  D. (1971).  "A general  method applicable to  the 
search for  similarities  in the  amino  acid sequence  of  two proteins."  J.  Mol. Biol. 
48:443-453. 

3. Smith,   T.  F.   and Waterman,    M. S.  (1981).   "Identification  of  common   
molecular subsequences." J. Mol. Biol. 147: 195-197 

4 .Pearson, W.,R., Lipman, D.,J.  " Improved tools for  biological sequence comparison. "  
 
Proc Natl Acad Sci U S A. 1988 Apr;85(8):2444-8. 

5. Altschul, S.,  Gish, W.,  Miller,W., Myers, E.   W. and  Lipman, D. J.(1990).  "Basic 
local alignment search tool" J. Mol. Biol 215: 403-410

6. Alschul et  al.  (1998). "Gapped  BLAST  and  PSI-BLAST: a  new  generation of 
protein database search programs," Nucleic Acids Res 1997 Sep 1;25(17):3389-402 

7 .Higgins, D.  G., Thompson, J.   D. & Gibson,  T. J. (1996).   "Using CLUSTAL for 
multiple sequence alignments," Methods Enzymol 266, 383-402.b

8 .M  Gerstein  &  M  Levitt   (1998).  "Comprehensive Assessment   of Automatic   
Structural Alignment against a Manual  Standard, the  Scop Classification  of Proteins," 
Protein Science 7: 445-456.