Needleman Wunsch were among the first to develop a program to find the similarity between two protein sequences

Final Project

MBB 752a2

Jennifer Kavran

The Role of Gap Penalties in Searching for Distant Homologues

Large gapes in sequence alignments are often interpreted to mean the two sequences are so highly divergent they are unrelated. However, it has been shown that the loss of domains or large chunks of a sequence are common between distantly related proteins. As a result, conventional sequence alignments that implement linear gap penalties may not be able to efficiently find such distant homologues. It was thought the use of a logarithmic or exponential gap penalty could be appropriate for such searches. Until recently, such alignments were computationally infeasible. Mott et al recently published a paper describing a new sequence alignment program, monotone, which use monotonic non affine gap penalties to increase the sensitivity of sequence alignments and several optimization techniques to reduce the running time (1).

Sequence alignments were originally constructed with the aim of finding sequences with areas of significant homologies. The Needleman-Wunsch method finds the best global alignment by using a dot-matrix method (2). The Smith-Waterman method searches for regions of local sequence similarity, referred to as high scoring pairs (HSP), and builds the sequences are those pairs. This method allows for gaps by using an affine gap penalty (W), W=A+BN; where A is the cost of opening a gap, B is the cost of every additional space in the gap, and N is the length of the gap (3). The commonly used program, BLAST and gapped-BLAST follow a Smith-Waterman approach of searching for HSPs (4).

Affine gap penalties are predicted to assign penalties to stringently to find the distantly related homologues, which are separated by large non-overlapping spans of sequence. Gu et al studied the length of protein gaps by analyzing the sequence similarities of 78 human pseudogenes (5). It was found that the length of the gap varied logarithmically with the frequency (F) of their occurrence. The relationship could be expressed as F=CN^-B; where C and B are proportional constants and N is the length of the gap. From this frequency observation, they suggested that the gap penalty in sequence alignments should become W=A+B ln N; where A and B are constants and N is the length of the gap. This gap penalty could avoid the exclusion of such distantly related proteins by allowing the penalty to grow more slowly for longer gaps.

Altschul et al designed a program that takes a different approach to finding distant homologues (6). Instead of slowing down the growth of the gap penalty for large unpaired regions, instead the gap penalty would assign different penalties to unmatched and mismatched residues. The theory is that this gap penalty would simply ignore regions of low complexity and highlight conserved sequences. This three parameter gap penalty follows the form W=A + B(N₁-N₂) + CN₂; where A is the value for starting a gap, B for each residue inserted or deleted, C for each pair left unaligned, N₁ and N₂ the length of the two sequences being aligned. The paper fails to produce any clear guidelines for determining the value of the three parameters. From the examples given, the parameters seemed to be arrived at by a series of trials. Differentiating between mismatches and unaligned residues already occurs in sense by other methods such as PAM. Furthermore, the program fails to adequately address the fact that this gap penalty could still become very large for long unpaired segments unless the group gives better definitions for their three parameters.

Mott et al describes a program, monotone, that uses a non-linear gap penalty, a Smith-Waterman dot matrix method, and several optimization techniques which make the running of the program feasible. The program is set-up to accept four different gap penalties: (1) a user defined method, (2) a logarithmic penalty following the form W=A+BlogN, (3) an affine penalty following the form W=A+BN, and (4) a power law penalty following the form W=A+BN^C. Each of these gap penalties are monotonically increasing so that W(k)³ W(k-1). From this paper it is not clear when to each of these gap penalties are appropriate to use.

The computational time of this program varies, for average complexity cases, between O(mn) and O(mn(m+n)). Random alignment running times, a worst-case scenario, were not shown and the authors admit that the running times would be very long. The program runs efficiently because of several optimization techniques that are employed. In general, the program uses a candidate list to align the sequences but the list remains bounded by the use of stringent penalties. Stringent gap penalties refers to the a factor which assigns the sensitivity to the search and how distantly related the sequences can be. Mathematically it is defined as sensitivity assessment that follows the extreme value distribution.

The program was tested in three different runs. First, the results of a simple alignment of two related sequences were run using a standard Smith-Waterman method and monotone. Both programs gave similar results. Second, a series of runs were performed with sequences of increasing length and divergence. These runs were performed with and without the optimization. It was shown that the optimization techniques gave reasonable result times and seem to be quadratic in behavior. Also, the results of these searches followed the expected extreme value distribution, Pr(s³ r) = e^{-KMNe^(-l
s)}.. Third, two sequences were generated in which each contained to two domains that were highly homologues to the domains in the alternate sequence but were separate by a long stretch of random residues. A program using a standard affine gap penalty was unable to detect the match. However, monotone with a logarithmic gap penalty was able to find the match.

From these results it seems that the for the average case scenarios presented that monotone is able to align sequences with long gaps separating the regions of identity and that the optimization and set-up of this program allows it to be run in an acceptable time-scale. Additionally, this program is more equipped at finding those partners than the standard sequence alignment programs. This program therefore presents what could be the ground work for further programs aimed at finding long separated orthlogues, which in the advent of genomics, could prove to be a very necessary and rewarding technique.

References

1.Mott, Richard. (1999) Local sequence alignments with monotonic gap penalties. Bioinformatics, 15(6):455-462.

2. Needleman, SB & Wunsch, CD. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48:443-453.

3. Smith, TF & Watermam, MS. (1981) Identification of common molecular subsequences. J. Mol. Biol., 147:195-197.

4. Altschul, S.F., Madden, TL, Schaffer, AA, Zhang, J, Zhang, Z, Miller, W & Lipman, DJ. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res., 25(17):3389-3402.

5. Gu, Z & Li, W. (1995) The size distribution of insertions and deletions in human and rodent pseudogense suggests the logarithmic gap penalty for sequence alignment. J. Mol. Evol., 40:464-473.

6. Altscul, SF. (1998) Generalize affine gap costs for protein sequence alignment. Proteins, 32:88-96.