Molecular Biophysics & Biochemistry 447b3 / 747b3Bioinformatics
Aligning Text Strings
Raw Data ??? T C A T G C A T T G
T C A T G . | | | . C A T T G
T C A - T G | | | | . C A T T G
T C A T - G | | | | . C A T T G
Dynamic Programming
- What to do for Bigger String?
SSDSEREEHVKRFRQALDDTGMKVPMATTNLFTHPVFKDGGFTANDRDVRRYALRKTIRNIDLAVELGAETYVAWGGREGAESGGAKDVRDALDRMKEAFDLLGEYVTSQGYDIRFAIEP
KPNEPRGDILLPTVGHALAFIERLERPELYGVNPEVGHEQMAGLNFPHGIAQALWAGKLFHIDLNGQNGIKYDQDLRFGAGDLRAAFWLVDLLESAGYSGPRHFDFKPPRTEDFDGVWAS
- Needleman-Wunsch (1970) provided first automatic method
- Dynamic Programming to Find Global Alignment
- Their Test Data (J->Y)
- ABCNYRQCLCRPMAYCYNRCKCRBP
Step 1 -- Make a Dot Plot (Similarity Matrix)
Put 1's where characters are identical.
A More Interesting Dot Matrix
Step 2 -- Start Computing the Sum Matrix
cell(R,C) { Old value, either 1 or 0 }
cell (R+1, C+1), { Diagonally Down, no gaps }
cells(R+1, C+2 to C_max),{ Down a row, making col. gap }
cells(R+2 to R_max, C+2) { Down a col., making row gap }
Step 3 -- Keep Going
Step 4 -- Sum Matrix All Done
Alignment Score is 8 matches.
Step 5 -- Traceback
Find Best Score (8) and Trace BackA B C N Y - R Q C L C R - P MA Y C - Y N R - C K C R B P
Step 6 -- Alternate Tracebacks
A B C - N Y R Q C L C R - P MA Y C Y N - R - C K C R B P
Key Idea in Dynamic Programming
- The best alignment that ends at a given pair of positions (i and j) in the 2 sequences is the score of the best alignment previous to this position PLUS the score for aligning those two positions.
- An Example Below
- Aligning R to K does not affect alignment of previous N-terminal residues. Once this is done it is fixed. Then go on to align D to E.
- How could this be violated? Aligning R to K changes best alignment in box.
Computational Complexity
- Basic NW Algorithm is O(n2) (in speed)
- M x N squares to fill
- At each square need to look back (M’+N’) “black” squares to find max in block
- M x N x (M’+N’) -> O(n3)
- However, max values in block can be cached, so algorithm is really only O(n2)
- Improvements can (effectively) reduce sequence comparison to O(n) in both
Gap Penalties
The score at a position can also factor in a penalty for introducing gaps (i. e., not going from i, j to i- 1, j- 1).
Gap penalties are often of linear form:
GAP = a + bN
GAP is the gap penalty
a = cost of opening a gap
b = cost of extending the gap by one (affine)
N = length of the gap
(Here assume b=0, a=1/2, so GAP = 1/2 regardless of length.)
Step 2 -- Computing the Sum Matrix with Gaps
cell(R,C) { Old value, either 1 or 0 }
cell (R+1, C+1), { Diagonally Down, no gaps }
cells(R+1, C+2 to C_max) - GAP ,{ Down a row, making col. gap }
cells(R+2 to R_max, C+2) - GAP { Down a col., making row gap }
Similarity (Substitution) Matrix
- Identity Matrix
- Match L with L => 1Match L with D => 0Match L with V => 0??
- S(aa-1,aa-2)
- Match L with L => 1Match L with D => 0Match L with V => .5
A R N D C Q E G H I L K M F P S T W Y V
A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0
R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3
N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3
D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3
C 0 -3 -3 -3 8 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1
Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2
E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2
G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3
H -2 0 1 -1 -3 0 0 -2 7 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 6 -1 -1 -4 -3 -2
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 10 2 -3
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 6 -1
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
Where do matrices come from?
1 Manually align protein structures(or, more risky, sequences)
2 Look at frequency of a.a. substitutionsat structurally constant sites.
S(aa-1,aa-2) = log ( freq(O) / freq(E) )
O = observed exchanges, E = expected exchanges
+ —> More likely than random
- —> Less likely than random
The Score
S(i,j) = similarity matrix score for aligning i and j
Sum is carried out over all aligned i and j
n = number of gaps (assuming no gap ext. penalty)
Molecular Biology Information:Macromolecular Structure
- DNA/RNA/Protein
- Almost all protein
(RNA Adapted From D Soll Web Page, Right Hand Top Protein from M Levitt web page)
Molecular Biology Information: Protein Structure Details
- Statistics on Number of XYZ triplets
- 200 residues/domain -> 200 CA atoms, separated by 3.8 A
- Avg. Residue is Leu: 4 backbone atoms + 4 sidechain atoms, 150 cubic A
- => ~1500 xyz triplets (=8x200) per protein domain
- 10 K known domain, ~300 folds
ATOM 1 C ACE 0 9.401 30.166 60.595 1.00 49.88 1GKY 67
ATOM 2 O ACE 0 10.432 30.832 60.722 1.00 50.35 1GKY 68
ATOM 3 CH3 ACE 0 8.876 29.767 59.226 1.00 50.04 1GKY 69
ATOM 4 N SER 1 8.753 29.755 61.685 1.00 49.13 1GKY 70
ATOM 5 CA SER 1 9.242 30.200 62.974 1.00 46.62 1GKY 71
ATOM 6 C SER 1 10.453 29.500 63.579 1.00 41.99 1GKY 72
ATOM 7 O SER 1 10.593 29.607 64.814 1.00 43.24 1GKY 73
ATOM 8 CB SER 1 8.052 30.189 63.974 1.00 53.00 1GKY 74
ATOM 9 OG SER 1 7.294 31.409 63.930 1.00 57.79 1GKY 75
ATOM 10 N ARG 2 11.360 28.819 62.827 1.00 36.48 1GKY 76
ATOM 11 CA ARG 2 12.548 28.316 63.532 1.00 30.20 1GKY 77
ATOM 12 C ARG 2 13.502 29.501 63.500 1.00 25.54 1GKY 78
ATOM 1444 CB LYS 186 13.836 22.263 57.567 1.00 55.06 1GKY1510
ATOM 1445 CG LYS 186 12.422 22.452 58.180 1.00 53.45 1GKY1511
ATOM 1446 CD LYS 186 11.531 21.198 58.185 1.00 49.88 1GKY1512
ATOM 1447 CE LYS 186 11.452 20.402 56.860 1.00 48.15 1GKY1513
ATOM 1448 NZ LYS 186 10.735 21.104 55.811 1.00 48.41 1GKY1514
ATOM 1449 OXT LYS 186 16.887 23.841 56.647 1.00 62.94 1GKY1515
TER 1450 LYS 186 1GKY1516
Sperm Whale Myoglobin
Structural Alignment of Two Globins
Immunoglobulin Alignment (Harder)
Some Similarities are Readily Apparent others are more Subtle
Very Subtle: G3P-dehydro-genase, C-term. domain
Automatic Structural Alignment
Generalized Similarity Matrix
- PAM(A,V) = 0.5
- Applies at every position
- S(aa @ i, aa @ J)
- Specific Matrix for each pair of residues i in protein 1 and J in protein 2
- Example is Y near N-term. matches any C-term. residue (Y at J=2)
- S(i,J)
- Doesn’t need to depend on a.a. identities at all!
- Just need to make up a score for matching residue i in protein 1 with residue J in protein 2
Similarity Matrixfor Structural Alignment
- Structural Alignment
- Similarity Matrix S(i,J) depends on the 3D coordinates of residues i and J
- Distance between CA of i and J
-
- M(i,j) = 100 / (5 + d2)
- Threading
- S(i,J) depends on the how well the amino acid at position i in protein 1 fits into the 3D structural environment at position J of protein 2