Shaheen Karim
MB&B 452a
The PAM and Risler Matrices:
Two Different Approaches Toward Exchange Weight Determination
Techniques involving the
alignments of protein sequences represent some of the most important tools in
bioinformatics. Algorithms used to
align sequences typically require a “scoring matrix” to differentiate between
competing alignments that are possible.
This matrix represents a scoring scheme that quantifies the relative
“exchangabilities” of all possible pairs (400) of the twenty amino acids,
expressed as weighted probabilities.
Dayhoff et al, in his widely
used PAM matrices, proposed a scoring scheme based on the frequency of observed
mutations in evolutionarily related proteins[i]. Risler et al. Proposed a method whereby the three dimensional
superposition of the backbone topologies of two different but related proteins
is used to generate a set of amino acid exchanges that can be statistically
analyzed[ii]. As closer examination of these two methods
will show, these two different approaches exemplify the variety in theoretical
approaches that can be used to differentiate competing possible alignments.
Dayhoff
developed a scoring matrix that is derived from the observed frequencies of
evolutionarily accepted point mutations in closely related proteins. His
methodology, calculations and definitions are summarized as followsi. An accepted point mutation is defined a
residue exchange evolutionarily accepted and adopted by the species as its
predominant form. A total of 1572
accepted mutations were observed in 71 phylogenetic trees, comprising 34
protein superfamilies. Each mutation
observed was entered into the accepted point mutation matrix. An element Aij of this matrix
represents the number of exchanges observed in all phylogenetic trees between
the amino acid in column j and the amino acid in column i . In the second step,
the relative mutability of the individual amino acids was calculated because
amino acids are not equally mutable.
That is to say, some residues are observed to mutate more frequently
than others per occurrence. This is taken
into consideration by defining the relative mutability of amino acid j as the
number of times amino acid j mutated divided by the number of occurrences of
amino acid j . Data from the relative
mutabilities of the amino acids and the point accepted mutation matrix is then
used to calculate the mutation probability matrix. Elements of the matrix represent, for a given evolutionary
distance in PAM units, the probability that the residue in row j will be
replaced by the residue in column i. A PAM evolutionary unit represents the
time span in which there has been 1 accepted mutation per 100 linkages on the
path between two proteins. The 1 PAM
matrix may be multiplied by itself N times to obtain the substitution matrix
expected after N PAM units. Hence, data
may be extrapolated to align distantly related proteins. The elements of the mutation probability
matrix are defined as follows:
Aij = element of point accepted matrix
l = proportionality constant
mj = mutability of amino acid j
Mij(non – diagonal elements) = (l * mj * Aij)/ Sum of Aij over all I
Mjj (Diagonal elements) = 1 - l * mj
In a final step, the relatedness log odds matrix
with elements Rij is then calculated as follows:
Rij = log (Mij/fi)
Where fi is equal to the # of
observations of amino acid i / # of all amino acids observed. The antilog of the element Rij represents
the expected frequency that the residues i and j would exchange, relative to
what random chance would predict. For
example, a score of 0.1 indicates that the replacement is observed 1.25 times
more often than would be expected by random.
Similarly, a score of –0.3 indicates that the residue exchange is
observed ½ times less than would be expected by random. Thus when performing alignments, this
comparison matrix may be used to assign similarity weights to every possible
amino acid substitution pair.
The
approach of Risler et al is very
different from that of Dayhoff. Risler first superimposed in three dimensions
the Ca carbon topological backbones of the two protein sequences to be
examined. Amino acid exchanges that
were no more than 1.2 angstroms apart and within a stretch of three consecutive
close ( 1.2 angstroms) pairs were considered. This second restraint was imposed to ensure
that amino acid pairs in structurally unrelated segments that superimposed by
chance were not included in the calculation.
Utilizing this method, 2860 amino acid exchanges were considered from
the alignment of 32 structures. The
observed acceptable substitution pairs were entered into a 20 x 20 matrix,
where elements Aij of this matrix represent the number of exchanges
observed between amino acid i and j.
Classical c2
analysis was then used to derive a structural superposition distance matrix[iii]. A scoring matrix appropriate for standard
alignments programs can be deduced from this matrix.
There
are similarities and differences between these approaches that deserve
mention. Both of these approaches are
based on actual observations rather than purely on theoretical parameters used
by others, such as the physico-chemical properties of the residue side chains[iv],[v],[vi] or minimum genetic distance in codons[vii],
for example. This is advantageous in that observed phenomenon not predicted
from purely a priori considerations
may nevertheless be predicted because of a method utilizing objective
observations. It is important to
mention however, that unlike Risler’s matrix, the PAM series of scoring matrices
were calculated from extrapolating data from highly similar proteins (>85%
similarity) to detected distantly related proteins. An approach that more explicitly represents the substitutions
observed at father evolutionary distances would certainly offer advantages.
Risler argues that the logic of the PAM matrices is circularii. It is argued that the method for initially
aligning phylogenetically related proteins in order to observe accepted
mutations introduces an inherent bias in the resulting PAM matrices. The PAM matrices nevertheless perform well
and are often considered to be the standard set of matrices against which the
performance of all other matrices are accessedii. For both the Risler and Dayhoff matrices,
however, the use of data sets that are more comprehensive and updated would
improve performance.
The matrices of Dayhoff and Risler are thus representative of the variety in techniques used to measure similarity between protein sequences. Approaches to this class of matrix utilize different types of data, such as genetic, structural, and physical-chemical data sets, to determine sets of weighted probabilities for each possible amino acid exchange. As matrices are derived from different data sets and calculated using various methodologies, each matrix has its own inherent strengths and weaknesses. Through understanding the fundamental underpinnings of methods such as those developed by Risler and Dayhoff, a better appreciation into their complexity as well as insight into possible directions of future exploration may be gained.
[i] Dayhoff,M.O., Schwartz, R.M. & Orcutt, B.C. (1978) A model of evolutionary change in proteins. In Atlas of Protein Sequence and Structure, vol 5. Suppl. 3. Pp. 345-352, National Biomedical Research Foundation, Washington, DC
[ii] Risler,
J.L., Delorme, M.O., Delacroix, H. & Henaut, A. (1988) Amino acid
substitutions in structurally related proteins. A pattern recognition approach.
J. Mol. Biol. 204, 1019-1029
[iii] Foucart, T. (1982) In Analyse Factorielle. Programmation sur Micro ordinateurs. Pp. 184-221, Masson , Paris
[iv] Miyata, T.,
Miyazawa, S. & Yasunaga, T. (1979). Two types of amino acid substitutions
in protein evolution. J. Mol. Evol. 12, 219-236
[v] Rao, J.K. (1987). New scoring matrix for amino acid residue exchanges based on residue characteristic physical parameters. Int. J. Pept. Protein Res. 29, 276-281
[vi] Grantham,
R. (1974) Amino acid difference formula to help explain protein evolution. Science, 185
[vii] Fitch, W.M. (1966) An improved method of testing for evolutionary homology. J. Mol. Biol. 16, 9-16.