Presentation for Bioinformatics:

Presentation for Bioinformatics:

ALIGNMENT METHODS

FOR DUMMIES

(or undergraduates with little science background)

PART I:

SEQUENCE ALIGNMENT

BY ED PILLSBURY

PART II:

STRUCTURAL ALIGNMENT

BY RANJIT BINDRA

WHY IS THIS IMPORTANT?

SEQUENCE HOMOLOGY:

-nucleotide comparison

-protein sequence comparison

CHARACTERIZATION OF PROTEINS

DETERMINATION OF FUNCTION

EVOLUTIONARY ORIGIN

AND MORE!!!!

DOT MATRIX METHOD:

-first proposed by Gibbs and McIntyre (1970) for sequence alignment

-Advantages include identification of intrasequence rearrangement

-set up 2 x 2 matrix of protein sequences, e.g.

dotmatrixmethod vs. matrixdotmethod

-diagonals indicate homology

-gaps between diagonalsà

SHOW SEQUENCE DIPLACEMENT

-other similarities are considered noiseà

noise is filtered at stringency levels determined by experimenter

PROS AND CONS:

PROSà

-deals effectively with intrasequence rearrangements (a drawback of dynamic programming)

-for this reason should be used first for any sequence analysis (Collins and Coulson 1987)

CONSà

-difficult to align sequences optimally in a consistent quantifiable manner

-relies on experimenter to introduce gaps to achieve better alignment

DYNAMIC PROGRAMMING

-NEEDLMAN-WUNSCH (1970) provided first theoretical framework for global alignment (i.e. comparison of the entire sequence)

BASIC PREMISES OF DP:

-align two sequences allowing for all possible gaps (i.e. deletions in the protein or nucleotide sequence)

-optimal alignment is determined by evaluating each unit (amino acid or nucleotide) in a 2 x 2 matrix

DYNAMIC PROGRAMMING

"the best alignment that ends at a given pair of bases or residues is the best alignment of the sequences up to that point, plus the score for aligning the two additional bases" -States and Boguski

example:

when comparing two sequences A + B

A à AATGC

B à AG--GC

The optimal match is determined by the score of the precursor match plus the score of the two additional units:

AATG C NEW

+ OPTIMAL

AG--G C MATCH

DYNAMIC PROGRAMMING

-brute force calculation of sequence comparison with insertion of GAPS would be:

N² operations done 2N times =

N^4N

-meaning that a comparison of two sequences with 300 residues would take

10⁸⁸ operations

(IMPOSSIBLE!!!!)

- DP by utilizing optimal alignment requires only N² operations, good for computational efficiency

DYNAMIC PROGRAMMING

Example of sequence comparison:

à proteins

TEXASRANGERSARETHEGREATESTBASEBALLTEAM

VS.

REDSKINSARETHEGREATESTFOOTBALLTEAM

Step 1) Set up 2 x 2 Identity Matrix and assign values of 1 to matches

DYNAMIC PROGRAMMING

Step 2) Compute the Sum Matrix for the alignment with the algorithm:

New_value_cell <= cell (R,C) +

Max { Cell (R+1, C+1),

Cells (R+1, C+2 to C_max),

Cells (R+2 to R_max, C+1) }

WHAT DOES THIS MEAN??

1^ST ) TAKE VALUE IN IDENTITY MATRIX (0 OR 1)

2^ND ) ADD THIS VALUE TO THE MAXIMUM VALUE LOCATED IN THE ROW OR COLUMN DIAGONAL DOWN FROM THE CELL IN THE SUM MATRIX

DYNAMIC PROGRAMMING

Start Chugging Away!!!!

	T	E	X	A	S	R	A	N	G	E	R	S
R						1					1
E		1								1
D
S					1							1
K
I
N								2	1	1	1	0
S					1			1	1	1	1	1
A				1			1	1	1	1	0	0
R						1		1	1	0	1	0
E	0	1	0	0	0	0	0	0	0	1	0	0

DYNAMIC PROGRAMMING

Step 3: FINISH MATRIX

	T	E	X	A	S	R	A	N	G	E	R	S
R	4	3	3	3	2	3	2	2	2	1	2	0
E	3	4	3	3	2	2	2	1	1	2	1	0
D	3	3	3	3	2	2	2	1	1	1	1	0
S	3	3	3	3	3	2	2	1	1	1	1	1
K	3	3	3	3	2	2	2	1	1	1	1	0
I	3	3	3	3	2	2	2	1	1	1	1	0
N	3	3	3	3	2	2	1	2	1	1	1	0
S	3	3	3	2	3	2	1	1	1	1	0	1
A	2	2	2	3	2	1	2	1	1	1	0	0
R	1	1	1	1	1	2	1	1	1	0	1	0
E	0	1	0	0	0	0	0	0	0	1	0	0

à Number in top left corner is Alignment score

DYNAMIC PROGRAMMING

Step 4) Traceback the optimum match

	T	E	X	A	S	R	A	N	G	E	R	S
R	4	3	3	3	2	3	2	2	2	1	2	0
E	3	4	3	3	2	2	2	1	1	2	1	0
D	3	3	3	3	2	2	2	1	1	1	1	0
S	3	3	3	3	3	2	2	1	1	1	1	1
K	3	3	3	3	2	2	2	1	1	1	1	0
I	3	3	3	3	2	2	2	1	1	1	1	0
N	3	3	3	3	2	2	1	2	1	1	1	0
S	3	3	3	2	3	2	1	1	1	1	0	1
A	2	2	2	3	2	1	2	1	1	1	0	0
R	1	1	1	1	1	2	1	1	1	0	1	0
E	0	1	0	0	0	0	0	0	0	1	0	0

Alignment scheme from traceback:

T E X A S R A N G E R S

R E D S K I -- N S A R E

-Alternate tracebacks are possible

(but not with this example, see worksheet)

DYNAMIC PROGRAMMING

Variations of previous model:

-Analysis can become more complex with the utilization of different substitution matrices instead of identity matrix e.g.

Protein Similarity Matrix, value in matrix determined by amino acid characteristics like hydrophobicity, charge, size, etc. . .

-Gap penalties are introduced to mitigate the negative alignment effects of breaking a sequence

BOTH FACTORS AFFECT THE FINAL ALIGNMENT SCORE

LOCAL VS. GLOBAL ALIGNMENT

global alignment seeks best overall alignment

-local alignment seeks best alignment regardless of length

Different results for these techniquesà

Local- ignores overall sequence similarity highlighting common substructure/subfunction

Global- sacrifices local scores to ensure the best overall sequence alignment (in DP example, algorithm is required to end in the top left box)

HASHING

-sometimes N² operations is still too slow for large sequence comparisons

-HASHING is method used to increase speed of sequence alignment (used in conventional programs e.g. FASTA, BLAST, BLAZE)

basically involves breakdown of original sequence to facilitate comparison between the two

	T	E	X	A	S	R	A	N	G	E	R	S
R						1					1
E		1								1
D
S					1							1
K
I
N								2	1	1	1	0
S					1			1	1	1	1	1
A				1			1	1	1	1	0	0
R						1		1	1	0	1	0
E	0	1	0	0	0	0	0	0	0	1	0	0

	T	E	X	A	S	R	A	N	G	E	R	S
R	4	3	3	3	2	3	2	2	2	1	2	0
E	3	4	3	3	2	2	2	1	1	2	1	0
D	3	3	3	3	2	2	2	1	1	1	1	0
S	3	3	3	3	3	2	2	1	1	1	1	1
K	3	3	3	3	2	2	2	1	1	1	1	0
I	3	3	3	3	2	2	2	1	1	1	1	0
N	3	3	3	3	2	2	1	2	1	1	1	0
S	3	3	3	2	3	2	1	1	1	1	0	1
A	2	2	2	3	2	1	2	1	1	1	0	0
R	1	1	1	1	1	2	1	1	1	0	1	0
E	0	1	0	0	0	0	0	0	0	1	0	0

	T	E	X	A	S	R	A	N	G	E	R	S
R	4	3	3	3	2	3	2	2	2	1	2	0
E	3	4	3	3	2	2	2	1	1	2	1	0
D	3	3	3	3	2	2	2	1	1	1	1	0
S	3	3	3	3	3	2	2	1	1	1	1	1
K	3	3	3	3	2	2	2	1	1	1	1	0
I	3	3	3	3	2	2	2	1	1	1	1	0
N	3	3	3	3	2	2	1	2	1	1	1	0
S	3	3	3	2	3	2	1	1	1	1	0	1
A	2	2	2	3	2	1	2	1	1	1	0	0
R	1	1	1	1	1	2	1	1	1	0	1	0
E	0	1	0	0	0	0	0	0	0	1	0	0

	T	E	X	A	S	R	A	N	G	E	R	S
R						1					1
E		1								1
D
S					1							1
K
I
N								2	1	1	1	0
S					1			1	1	1	1	1
A				1			1	1	1	1	0	0
R						1		1	1	0	1	0
E	0	1	0	0	0	0	0	0	0	1	0	0

	T	E	X	A	S	R	A	N	G	E	R	S
R	4	3	3	3	2	3	2	2	2	1	2	0
E	3	4	3	3	2	2	2	1	1	2	1	0
D	3	3	3	3	2	2	2	1	1	1	1	0
S	3	3	3	3	3	2	2	1	1	1	1	1
K	3	3	3	3	2	2	2	1	1	1	1	0
I	3	3	3	3	2	2	2	1	1	1	1	0
N	3	3	3	3	2	2	1	2	1	1	1	0
S	3	3	3	2	3	2	1	1	1	1	0	1
A	2	2	2	3	2	1	2	1	1	1	0	0
R	1	1	1	1	1	2	1	1	1	0	1	0
E	0	1	0	0	0	0	0	0	0	1	0	0

	T	E	X	A	S	R	A	N	G	E	R	S
R	4	3	3	3	2	3	2	2	2	1	2	0
E	3	4	3	3	2	2	2	1	1	2	1	0
D	3	3	3	3	2	2	2	1	1	1	1	0
S	3	3	3	3	3	2	2	1	1	1	1	1
K	3	3	3	3	2	2	2	1	1	1	1	0
I	3	3	3	3	2	2	2	1	1	1	1	0
N	3	3	3	3	2	2	1	2	1	1	1	0
S	3	3	3	2	3	2	1	1	1	1	0	1
A	2	2	2	3	2	1	2	1	1	1	0	0
R	1	1	1	1	1	2	1	1	1	0	1	0
E	0	1	0	0	0	0	0	0	0	1	0	0

	T	E	X	A	S	R	A	N	G	E	R	S
R						1					1
E		1								1
D
S					1							1
K
I
N								2	1	1	1	0
S					1			1	1	1	1	1
A				1			1	1	1	1	0	0
R						1		1	1	0	1	0
E	0	1	0	0	0	0	0	0	0	1	0	0

	T	E	X	A	S	R	A	N	G	E	R	S
R	4	3	3	3	2	3	2	2	2	1	2	0
E	3	4	3	3	2	2	2	1	1	2	1	0
D	3	3	3	3	2	2	2	1	1	1	1	0
S	3	3	3	3	3	2	2	1	1	1	1	1
K	3	3	3	3	2	2	2	1	1	1	1	0
I	3	3	3	3	2	2	2	1	1	1	1	0
N	3	3	3	3	2	2	1	2	1	1	1	0
S	3	3	3	2	3	2	1	1	1	1	0	1
A	2	2	2	3	2	1	2	1	1	1	0	0
R	1	1	1	1	1	2	1	1	1	0	1	0
E	0	1	0	0	0	0	0	0	0	1	0	0

	T	E	X	A	S	R	A	N	G	E	R	S
R	4	3	3	3	2	3	2	2	2	1	2	0
E	3	4	3	3	2	2	2	1	1	2	1	0
D	3	3	3	3	2	2	2	1	1	1	1	0
S	3	3	3	3	3	2	2	1	1	1	1	1
K	3	3	3	3	2	2	2	1	1	1	1	0
I	3	3	3	3	2	2	2	1	1	1	1	0
N	3	3	3	3	2	2	1	2	1	1	1	0
S	3	3	3	2	3	2	1	1	1	1	0	1
A	2	2	2	3	2	1	2	1	1	1	0	0
R	1	1	1	1	1	2	1	1	1	0	1	0
E	0	1	0	0	0	0	0	0	0	1	0	0