MB&B 452a Bioinformatics Final Project

Shaheen Karim

MB&B 452a

The PAM and Risler Matrices:
Two Different Approaches Toward Exchange Weight Determination

Techniques involving the alignments of protein sequences represent some of the most important tools in bioinformatics. Algorithms used to align sequences typically require a “scoring matrix” to differentiate between competing alignments that are possible. This matrix represents a scoring scheme that quantifies the relative “exchangabilities” of all possible pairs (400) of the twenty amino acids, expressed as weighted probabilities. Dayhoff et al, in his widely used PAM matrices, proposed a scoring scheme based on the frequency of observed mutations in evolutionarily related proteins[i]. Risler et al. Proposed a method whereby the three dimensional superposition of the backbone topologies of two different but related proteins is used to generate a set of amino acid exchanges that can be statistically analyzed[ii]. As closer examination of these two methods will show, these two different approaches exemplify the variety in theoretical approaches that can be used to differentiate competing possible alignments.

Dayhoff developed a scoring matrix that is derived from the observed frequencies of evolutionarily accepted point mutations in closely related proteins. His methodology, calculations and definitions are summarized as followsⁱ. An accepted point mutation is defined a residue exchange evolutionarily accepted and adopted by the species as its predominant form. A total of 1572 accepted mutations were observed in 71 phylogenetic trees, comprising 34 protein superfamilies. Each mutation observed was entered into the accepted point mutation matrix. An element A_ij of this matrix represents the number of exchanges observed in all phylogenetic trees between the amino acid in column j and the amino acid in column i . In the second step, the relative mutability of the individual amino acids was calculated because amino acids are not equally mutable. That is to say, some residues are observed to mutate more frequently than others per occurrence. This is taken into consideration by defining the relative mutability of amino acid j as the number of times amino acid j mutated divided by the number of occurrences of amino acid j . Data from the relative mutabilities of the amino acids and the point accepted mutation matrix is then used to calculate the mutation probability matrix. Elements of the matrix represent, for a given evolutionary distance in PAM units, the probability that the residue in row j will be replaced by the residue in column i. A PAM evolutionary unit represents the time span in which there has been 1 accepted mutation per 100 linkages on the path between two proteins. The 1 PAM matrix may be multiplied by itself N times to obtain the substitution matrix expected after N PAM units. Hence, data may be extrapolated to align distantly related proteins. The elements of the mutation probability matrix are defined as follows:

A_ij = element of point accepted matrix

l = proportionality constant

m_j = mutability of amino acid j

Mij(non – diagonal elements) = (l * m_j* A_ij)/ Sum of A_ijover all I

Mjj (Diagonal elements) = 1 - l * m_j

In a final step, the relatedness log odds matrix with elements R_ij is then calculated as follows:

Rij = log (M_ij/f_i)

Where f_i is equal to the # of observations of amino acid i / # of all amino acids observed. The antilog of the element Rij represents the expected frequency that the residues i and j would exchange, relative to what random chance would predict. For example, a score of 0.1 indicates that the replacement is observed 1.25 times more often than would be expected by random. Similarly, a score of –0.3 indicates that the residue exchange is observed ½ times less than would be expected by random. Thus when performing alignments, this comparison matrix may be used to assign similarity weights to every possible amino acid substitution pair.

The approach of Risler et al is very different from that of Dayhoff. Risler first superimposed in three dimensions the C^a carbon topological backbones of the two protein sequences to be examined. Amino acid exchanges that were no more than 1.2 angstroms apart and within a stretch of three consecutive close ( 1.2 angstroms) pairs were considered. This second restraint was imposed to ensure that amino acid pairs in structurally unrelated segments that superimposed by chance were not included in the calculation. Utilizing this method, 2860 amino acid exchanges were considered from the alignment of 32 structures. The observed acceptable substitution pairs were entered into a 20 x 20 matrix, where elements A_ij of this matrix represent the number of exchanges observed between amino acid i and j. Classical c² analysis was then used to derive a structural superposition distance matrix[iii]. A scoring matrix appropriate for standard alignments programs can be deduced from this matrix.

There are similarities and differences between these approaches that deserve mention. Both of these approaches are based on actual observations rather than purely on theoretical parameters used by others, such as the physico-chemical properties of the residue side chains[iv]^,[v],[vi] or minimum genetic distance in codons[vii], for example. This is advantageous in that observed phenomenon not predicted from purely a priori considerations may nevertheless be predicted because of a method utilizing objective observations. It is important to mention however, that unlike Risler’s matrix, the PAM series of scoring matrices were calculated from extrapolating data from highly similar proteins (>85% similarity) to detected distantly related proteins. An approach that more explicitly represents the substitutions observed at father evolutionary distances would certainly offer advantages. Risler argues that the logic of the PAM matrices is circularⁱⁱ. It is argued that the method for initially aligning phylogenetically related proteins in order to observe accepted mutations introduces an inherent bias in the resulting PAM matrices. The PAM matrices nevertheless perform well and are often considered to be the standard set of matrices against which the performance of all other matrices are accessedⁱⁱ. For both the Risler and Dayhoff matrices, however, the use of data sets that are more comprehensive and updated would improve performance.

The matrices of Dayhoff and Risler are thus representative of the variety in techniques used to measure similarity between protein sequences. Approaches to this class of matrix utilize different types of data, such as genetic, structural, and physical-chemical data sets, to determine sets of weighted probabilities for each possible amino acid exchange. As matrices are derived from different data sets and calculated using various methodologies, each matrix has its own inherent strengths and weaknesses. Through understanding the fundamental underpinnings of methods such as those developed by Risler and Dayhoff, a better appreciation into their complexity as well as insight into possible directions of future exploration may be gained.