Presentation for Bioinformatics:

 

 

 

ALIGNMENT METHODS

FOR DUMMIES

(or undergraduates with little science background)

 

 

PART I:

SEQUENCE ALIGNMENT

BY ED PILLSBURY

 

 

PART II:

STRUCTURAL ALIGNMENT

BY RANJIT BINDRA

 

 

 

 

WHY IS THIS IMPORTANT?

 

SEQUENCE HOMOLOGY:

-nucleotide comparison

-protein sequence comparison

 

 

 

CHARACTERIZATION OF PROTEINS

 

DETERMINATION OF FUNCTION

 

EVOLUTIONARY ORIGIN

 

AND MORE!!!!

 

 

 

DOT MATRIX METHOD:

 

-first proposed by Gibbs and McIntyre (1970) for sequence alignment

 

-Advantages include identification of intrasequence rearrangement

 

-set up 2 x 2 matrix of protein sequences, e.g.

 

dotmatrixmethod vs. matrixdotmethod

 

 

 

 

 

 

 

 

 

D

O

T

M

A

T

R

I

X

M

E

T

H

O

D

M

     

X

         

X

         

A

       

X

                   

T

   

X

   

X

         

X

     

R

           

X

               

I

             

X

             

X

               

X

           

D

X

                         

X

O

 

X

                     

X

 

T

   

X

   

X

         

X

     

M

     

X

         

X

         

E

                   

X

       

T

   

X

               

X

     

H

                       

X

   

O

 

X

                     

X

 

D

X

                         

X

 

-diagonals indicate homology

 

-gaps between diagonalsà

SHOW SEQUENCE DIPLACEMENT

 

-other similarities are considered noiseà

noise is filtered at stringency levels determined by experimenter

PROS AND CONS:

 

PROSà

 

-deals effectively with intrasequence rearrangements (a drawback of dynamic programming)

 

-for this reason should be used first for any sequence analysis (Collins and Coulson 1987)

 

CONSà

 

-difficult to align sequences optimally in a consistent quantifiable manner

 

-relies on experimenter to introduce gaps to achieve better alignment

 

 

DYNAMIC PROGRAMMING

 

 

 

 

-NEEDLMAN-WUNSCH (1970) provided first theoretical framework for global alignment (i.e. comparison of the entire sequence)

 

BASIC PREMISES OF DP:

-align two sequences allowing for all possible gaps (i.e. deletions in the protein or nucleotide sequence)

 

-optimal alignment is determined by evaluating each unit (amino acid or nucleotide) in a 2 x 2 matrix

 

DYNAMIC PROGRAMMING

 

"the best alignment that ends at a given pair of bases or residues is the best alignment of the sequences up to that point, plus the score for aligning the two additional bases" -States and Boguski

 

example:

 

when comparing two sequences A + B

 

A à AATGC

B à AG--GC

 

The optimal match is determined by the score of the precursor match plus the score of the two additional units:

 

AATG C NEW

+ OPTIMAL

AG--G C MATCH

DYNAMIC PROGRAMMING

 

-brute force calculation of sequence comparison with insertion of GAPS would be:

N2 operations done 2N times =

N4N

-meaning that a comparison of two sequences with 300 residues would take

1088 operations

(IMPOSSIBLE!!!!)

 

 

- DP by utilizing optimal alignment requires only N2 operations, good for computational efficiency

 

DYNAMIC PROGRAMMING

 

Example of sequence comparison:

 

à proteins

 

TEXASRANGERSARETHEGREATESTBASEBALLTEAM

 

VS.

 

REDSKINSARETHEGREATESTFOOTBALLTEAM

 

Step 1) Set up 2 x 2 Identity Matrix and assign values of 1 to matches

 

T

E

X

A

S

R

A

N

G

E

R

S

R

1

1

E

1

1

D

S

1

1

K

I

N

1

S

1

1

A

1

1

R

1

1

E

1

1

 

 

 

 

DYNAMIC PROGRAMMING

 

 

Step 2) Compute the Sum Matrix for the alignment with the algorithm:

 

New_value_cell <= cell (R,C) +

Max { Cell (R+1, C+1),

Cells (R+1, C+2 to C_max),

Cells (R+2 to R_max, C+1) }

 

WHAT DOES THIS MEAN??

 

1ST ) TAKE VALUE IN IDENTITY MATRIX (0 OR 1)

 

2ND ) ADD THIS VALUE TO THE MAXIMUM VALUE LOCATED IN THE ROW OR COLUMN DIAGONAL DOWN FROM THE CELL IN THE SUM MATRIX

 

DYNAMIC PROGRAMMING

 

Start Chugging Away!!!!

 

 

T

E

X

A

S

R

A

N

G

E

R

S

R

1

1

E

1

1

D

S

1

1

K

I

N

2

1

1

1

0

S

1

1

1

1

1

1

A

1

1

1

1

1

0

0

R

1

1

1

0

1

0

E

0

1

0

0

0

0

0

0

0

1

0

0

 

 

 

 

 

 

 

 

 

 

 

 

DYNAMIC PROGRAMMING

 

Step 3: FINISH MATRIX

 

 

T

E

X

A

S

R

A

N

G

E

R

S

R

4

3

3

3

2

3

2

2

2

1

2

0

E

3

4

3

3

2

2

2

1

1

2

1

0

D

3

3

3

3

2

2

2

1

1

1

1

0

S

3

3

3

3

3

2

2

1

1

1

1

1

K

3

3

3

3

2

2

2

1

1

1

1

0

I

3

3

3

3

2

2

2

1

1

1

1

0

N

3

3

3

3

2

2

1

2

1

1

1

0

S

3

3

3

2

3

2

1

1

1

1

0

1

A

2

2

2

3

2

1

2

1

1

1

0

0

R

1

1

1

1

1

2

1

1

1

0

1

0

E

0

1

0

0

0

0

0

0

0

1

0

0

 

 

 

à Number in top left corner is Alignment score

 

 

 

 

 

 

DYNAMIC PROGRAMMING

 

Step 4) Traceback the optimum match

 

 

 

 

T

E

X

A

S

R

A

N

G

E

R

S

R

4

3

3

3

2

3

2

2

2

1

2

0

E

3

4

3

3

2

2

2

1

1

2

1

0

D

3

3

3

3

2

2

2

1

1

1

1

0

S

3

3

3

3

3

2

2

1

1

1

1

1

K

3

3

3

3

2

2

2

1

1

1

1

0

I

3

3

3

3

2

2

2

1

1

1

1

0

N

3

3

3

3

2

2

1

2

1

1

1

0

S

3

3

3

2

3

2

1

1

1

1

0

1

A

2

2

2

3

2

1

2

1

1

1

0

0

R

1

1

1

1

1

2

1

1

1

0

1

0

E

0

1

0

0

0

0

0

0

0

1

0

0

 

 

 

Alignment scheme from traceback:

 

T E X A S R A N G E R S

R E D S K I -- N S A R E

 

-Alternate tracebacks are possible

(but not with this example, see worksheet)

DYNAMIC PROGRAMMING

 

Variations of previous model:

 

-Analysis can become more complex with the utilization of different substitution matrices instead of identity matrix e.g.

Protein Similarity Matrix, value in matrix determined by amino acid characteristics like hydrophobicity, charge, size, etc. . .

 

-Gap penalties are introduced to mitigate the negative alignment effects of breaking a sequence

 

BOTH FACTORS AFFECT THE FINAL ALIGNMENT SCORE

 

LOCAL VS. GLOBAL ALIGNMENT

 

 

-local alignment seeks best alignment regardless of length

 

Different results for these techniquesà

 

Local- ignores overall sequence similarity highlighting common substructure/subfunction

 

Global- sacrifices local scores to ensure the best overall sequence alignment (in DP example, algorithm is required to end in the top left box)

 

HASHING

 

 

-sometimes N2 operations is still too slow for large sequence comparisons

 

-HASHING is method used to increase speed of sequence alignment (used in conventional programs e.g. FASTA, BLAST, BLAZE)

 

basically involves breakdown of original sequence to facilitate comparison between the two