nearest-neighbor.html

Nearest Neighbor Methods

The nearest neighbor method of secondary structure prediction has also been called memory-based, exemplar-based, or the homologous method. The method is performed by finding some number of the closest sequences (from a database of proteins with known structure) to a subsquence defined by a window around the amino acid of interest. Using the known secondary structures of the aligned sequences (generally weighted by their similarity to the target sequence) a secondary structure prediction is made. Though the method is simple to describe, a number of undefined parameters in the above description allow the method to be applied in a wide range of ways, with a similarly wide range of results. Sequences are chosen based on their similarity, but in what fashion is similarity defined? What size window of sequence should be examined? How many close sequences should be selected? Two methods, both nearest neighbor in approach, but each using a different set of these parameters will be examined: the memory based expert module from Zhang, et al.'s hybrid method (Zhang, et al., 1992), and the SSPAL algorithm of Salomov and Solovyev (Salomov and Solovyev, 1997).

The nearest neighbor method relies on selecting the closest subsequences to a window around the amino acid which is being predicted. Of course, this can be done in a number of ways. Zhang, et al. use a distance measure reminiscent of the probabilities used in the information theory of the GOR method (above). The distance in this case is a bit unusual and is a measure of similarity in secondary structure propensity between each pair of aligned amino acids:

where Di is the distance between amino acids ai and bi, m is the set of possible secondary structure states, n is the set of window positions, q is the number of amino acids (i.e. 20), and xhk indicates amino acid h at position k. Note that this distance measure is independent of the contents of the window other than the two amino acids being examined. In fact, for a given training set, only one distance matrix Dp,l is generated for use with all the alignments. Using this distance measure all possible alignments with the window are generated (the authors use a window size of 13 amino acids). The top 25 are selected, and the secondary structure of the central amino acid is predicted to be that of the plurality of the 25 adjusted to be inversely proportional to the distance between the sequences. The memory based module is the most successful (though only narrowly) of the hybrid system's modules, but scores only 64.5% accuracy.

Salomov and Solovyev's SSPAL method is significantly more successful. In fact it is the most successful seconday structure prediction method (on single protein sequences without homologs of known structure) which I have located, with a Q3 of 71.2%. An ennumeration of the differences in methodology might prove insightful. The first difference is the use of alignments of varying size, which may contain gaps. The authors claim that this allows the algorithm to access information which normal nearest neighbor methods may sacrifice. One can imagine how this might be the case for Zhang, et al.'s method. Imagine an alignment of the sort:

aaa-bb aaaabb

where a will be taken to mean a residue with strong helical potential and b a strong beta potential. Without allowing for gapped alignment, the score of this alignment will drop, perhaps significantly.

A second difference in the alignment is the use of a new scoring scheme, which uses both sequence similarity akin to that used by Zhang et al., and a 3D-1D score, first proposed by Bowie et al. (1991). This scoring method creates a distance matrix from the frequencies with which each amino acid appears in a given, defined chemical environment, e.g. "buried and partially polar." When a potential alignment is considered, part of the score dervies from the propensity of the amino acid being aligned to be in the (known) environment of the sequence in the database. Use of this type of scoring scheme generates a 4% improvement in accuracy (Yi and Lander, 1993).

Third, the first step in the algorithm is the reduction of the database for a given alignment from ca. 120 to the 90 proteins with closest pairwise Chou Fasman parameters. This method was previously explored in a method which attained a Q3 of only 67.6%. Thus, this step likely unresponsible for about 4% of the improvement in accuracy.

Last, the authors of SSPAL use more alignments (50 vs. 25 for Zhang, et al.) to generate their prediction. This likely has little impact on the quality of their prediction, since use of 20 or 30 alignments both yield ~71% accuracy, still far ahead of the Zhang algorithm.

As noted above, this method outperforms information theory, likely due to its ability to access third and higher order interactions (rather than only pairwise) and outside the window used by GOR. Unfortunately, the authors fail to give us any idea of how well their high scoring alignments correlate with certainty about the alignment, so one is unable to estimate how well the SSPAL algorithm competes with GOR in this regard.

Next Previous