Presentation for Bioinformatics:
ALIGNMENT METHODS
FOR DUMMIES
(or undergraduates with little science background)
PART I:
SEQUENCE ALIGNMENT
BY ED PILLSBURY
PART II:
STRUCTURAL ALIGNMENT
BY RANJIT BINDRA
WHY IS THIS IMPORTANT?
SEQUENCE HOMOLOGY:
-nucleotide comparison
-protein sequence comparison
CHARACTERIZATION OF PROTEINS
DETERMINATION OF FUNCTION
EVOLUTIONARY ORIGIN
AND MORE!!!!
DOT MATRIX METHOD:
-first proposed by Gibbs and McIntyre (1970) for sequence alignment
-Advantages include identification of intrasequence rearrangement
-set up 2 x 2 matrix of protein sequences, e.g.
dotmatrixmethod vs. matrixdotmethod
D |
O |
T |
M |
A |
T |
R |
I |
X |
M |
E |
T |
H |
O |
D |
|
M |
X |
X |
|||||||||||||
A |
X |
||||||||||||||
T |
X |
X |
X |
||||||||||||
R |
X |
||||||||||||||
I |
X |
||||||||||||||
X |
X |
||||||||||||||
D |
X |
X |
|||||||||||||
O |
X |
X |
|||||||||||||
T |
X |
X |
X |
||||||||||||
M |
X |
X |
|||||||||||||
E |
X |
||||||||||||||
T |
X |
X |
|||||||||||||
H |
X |
||||||||||||||
O |
X |
X |
|||||||||||||
D |
X |
X |
-diagonals indicate homology
-gaps between diagonalsà
SHOW SEQUENCE DIPLACEMENT
-other similarities are considered noiseà
noise is filtered at stringency levels determined by experimenter
PROS AND CONS:
PROSà
-deals effectively with intrasequence rearrangements (a drawback of dynamic programming)
-for this reason should be used first for any sequence analysis (Collins and Coulson 1987)
CONSà
-difficult to align sequences optimally in a consistent quantifiable manner
-relies on experimenter to introduce gaps to achieve better alignment
DYNAMIC PROGRAMMING
-NEEDLMAN-WUNSCH (1970) provided first theoretical framework for global alignment (i.e. comparison of the entire sequence)
BASIC PREMISES OF DP:
-align two sequences allowing for all possible gaps (i.e. deletions in the protein or nucleotide sequence)
-optimal alignment is determined by evaluating each unit (amino acid or nucleotide) in a 2 x 2 matrix
DYNAMIC PROGRAMMING
"the best alignment that ends at a given pair of bases or residues is the best alignment of the sequences up to that point, plus the score for aligning the two additional bases" -States and Boguski
example:
when comparing two sequences A + B
A à AATGC
B à AG--GC
The optimal match is determined by the score of the precursor match plus the score of the two additional units:
AATG C NEW
+ OPTIMAL
AG--G C MATCH
DYNAMIC PROGRAMMING
-brute force calculation of sequence comparison with insertion of GAPS would be:
N2
operations done 2N times =N4N
-meaning that a comparison of two sequences with 300 residues would take
1088 operations
(IMPOSSIBLE!!!!)
- DP by utilizing optimal alignment requires only
N2 operations, good for computational efficiency
DYNAMIC PROGRAMMING
Example of sequence comparison:
à proteins
TEXASRANGERSARETHEGREATESTBASEBALLTEAM
VS.
REDSKINSARETHEGREATESTFOOTBALLTEAM
Step 1) Set up 2 x 2 Identity Matrix and assign values of 1 to matches
T |
E |
X |
A |
S |
R |
A |
N |
G |
E |
R |
S |
|
R |
1 |
1 |
||||||||||
E |
1 |
1 |
||||||||||
D |
||||||||||||
S |
1 |
1 |
||||||||||
K |
||||||||||||
I |
||||||||||||
N |
1 |
|||||||||||
S |
1 |
1 |
||||||||||
A |
1 |
1 |
||||||||||
R |
1 |
1 |
||||||||||
E |
1 |
1 |
DYNAMIC PROGRAMMING
Step 2) Compute the Sum Matrix for the alignment with the algorithm:
New_value_cell <= cell (R,C) +
Max { Cell (R+1, C+1),
Cells (R+1, C+2 to C_max),
Cells (R+2 to R_max, C+1) }
WHAT DOES THIS MEAN??
1ST ) TAKE VALUE IN IDENTITY MATRIX (0 OR 1)
2ND ) ADD THIS VALUE TO THE MAXIMUM VALUE LOCATED IN THE ROW OR COLUMN DIAGONAL DOWN FROM THE CELL IN THE SUM MATRIX
DYNAMIC PROGRAMMING
Start Chugging Away!!!!
T |
E |
X |
A |
S |
R |
A |
N |
G |
E |
R |
S |
|
R |
1 |
1 |
||||||||||
E |
1 |
1 |
||||||||||
D |
||||||||||||
S |
1 |
1 |
||||||||||
K |
||||||||||||
I |
||||||||||||
N |
2 |
1 |
1 |
1 |
0 |
|||||||
S |
1 |
1 |
1 |
1 |
1 |
1 |
||||||
A |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
|||||
R |
1 |
1 |
1 |
0 |
1 |
0 |
||||||
E |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
DYNAMIC PROGRAMMING
Step 3: FINISH MATRIX
T |
E |
X |
A |
S |
R |
A |
N |
G |
E |
R |
S |
|
R |
4 |
3 |
3 |
3 |
2 |
3 |
2 |
2 |
2 |
1 |
2 |
0 |
E |
3 |
4 |
3 |
3 |
2 |
2 |
2 |
1 |
1 |
2 |
1 |
0 |
D |
3 |
3 |
3 |
3 |
2 |
2 |
2 |
1 |
1 |
1 |
1 |
0 |
S |
3 |
3 |
3 |
3 |
3 |
2 |
2 |
1 |
1 |
1 |
1 |
1 |
K |
3 |
3 |
3 |
3 |
2 |
2 |
2 |
1 |
1 |
1 |
1 |
0 |
I |
3 |
3 |
3 |
3 |
2 |
2 |
2 |
1 |
1 |
1 |
1 |
0 |
N |
3 |
3 |
3 |
3 |
2 |
2 |
1 |
2 |
1 |
1 |
1 |
0 |
S |
3 |
3 |
3 |
2 |
3 |
2 |
1 |
1 |
1 |
1 |
0 |
1 |
A |
2 |
2 |
2 |
3 |
2 |
1 |
2 |
1 |
1 |
1 |
0 |
0 |
R |
1 |
1 |
1 |
1 |
1 |
2 |
1 |
1 |
1 |
0 |
1 |
0 |
E |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
à Number in top left corner is Alignment score
DYNAMIC PROGRAMMING
Step 4) Traceback the optimum match
T |
E |
X |
A |
S |
R |
A |
N |
G |
E |
R |
S |
|
R |
4 |
3 |
3 |
3 |
2 |
3 |
2 |
2 |
2 |
1 |
2 |
0 |
E |
3 |
4 |
3 |
3 |
2 |
2 |
2 |
1 |
1 |
2 |
1 |
0 |
D |
3 |
3 |
3 |
3 |
2 |
2 |
2 |
1 |
1 |
1 |
1 |
0 |
S |
3 |
3 |
3 |
3 |
3 |
2 |
2 |
1 |
1 |
1 |
1 |
1 |
K |
3 |
3 |
3 |
3 |
2 |
2 |
2 |
1 |
1 |
1 |
1 |
0 |
I |
3 |
3 |
3 |
3 |
2 |
2 |
2 |
1 |
1 |
1 |
1 |
0 |
N |
3 |
3 |
3 |
3 |
2 |
2 |
1 |
2 |
1 |
1 |
1 |
0 |
S |
3 |
3 |
3 |
2 |
3 |
2 |
1 |
1 |
1 |
1 |
0 |
1 |
A |
2 |
2 |
2 |
3 |
2 |
1 |
2 |
1 |
1 |
1 |
0 |
0 |
R |
1 |
1 |
1 |
1 |
1 |
2 |
1 |
1 |
1 |
0 |
1 |
0 |
E |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
Alignment scheme from traceback:
T E X A S R A N G E R S
R E D S K I -- N S A R E
-Alternate tracebacks are possible
(but not with this example, see worksheet)
DYNAMIC PROGRAMMING
Variations of previous model:
-Analysis can become more complex with the utilization of different substitution matrices instead of identity matrix e.g.
Protein Similarity Matrix, value in matrix determined by amino acid characteristics like hydrophobicity, charge, size, etc. . .
-Gap penalties are introduced to mitigate the negative alignment effects of breaking a sequence
BOTH FACTORS AFFECT THE FINAL ALIGNMENT SCORE
LOCAL VS. GLOBAL ALIGNMENT
-local alignment seeks best alignment regardless of length
Different results for these techniquesà
Local- ignores overall sequence similarity highlighting common substructure/subfunction
Global- sacrifices local scores to ensure the best overall sequence alignment (in DP example, algorithm is required to end in the top left box)
HASHING
-sometimes N2 operations is still too slow for large sequence comparisons
-HASHING is method used to increase speed of sequence alignment (used in conventional programs e.g. FASTA, BLAST, BLAZE)
basically involves breakdown of original sequence to facilitate comparison between the two