MB&B 452a Bioinformatics Final Project

Hei B. Kwok (Benjamin)

ALIGNMENT METHODS

With the exponential growth of the genome database, sequence alignment programs have become more and more essential in modern molecular biology. Sequence alignment enables one to find differences as well as similarities between two sequences. It can also help to group genes or proteins according to their homology or orthology. With proper analyses, one can also build phylogenetic trees and reveal the evolutionary progression of a particular gene.

The fundamental basis of all alignment programs relies on the dot matrix. which is used to find identity between two sequences. If one uses a dot to represent identity, one will obtain a dot plot by placing one sequence horizontally and the other vertically. If we score identity as one, the summation of all the identities in a particular alignment will be the score for that alignment. The optimal alignment will be the one with the highest score that will be the longest diagonal in a dot plot. To achieve the highest score, it may be necessary to introduce gaps in the aligned sequences. Many alignment programs score alignments with a gap penalty. Gap penalties include gap-opening penalty and gap-extension penalty. In mathematical terms, a gap penalty can be represented by Gap Penalty = a + bN where a = cost to open a gap, b = cost to extend a gap, and N = length of the gap. One can favor local alignment by raising the cost of gap opening and lowering the cost of gap extension. A method developed by Needleman and Wunsch uses the principles of dot matrix and dynamic programming to favor global similarity between sequences while the Smith-Waterman alignment method emphasizes subsequences within the two sequences that can achieve the maximum score independent of their position and the rest of the sequences. Different methods can serve different purposes. Local alignment programs are useful to search motifs or common domains while global comparisons are useful for finding proteins in the same family.

Pairwise alignment may sometimes not be sufficient. As the database grow exponentially, multiple sequence alignment or search for similar DNA / protein sequence in the database are more important and informative to researchers. The development of FASTA and the BLAST programs were major advances in molecular biology. They use hash tables to accelerate the alignment process when you querying a sequence against a huge database. Once hash hits are identified, they are extended until the total scores are maximized. The highest score represents the highest similarity between the query and the available database. However, the significance of the score depends on statistics. The concept of the P-value is crucial to give a meaning to the raw score. As the database grows, the significance of a particular score decreases. Therefore, it is not the raw score that tells us the significance in similarity between two sequences, but rather, the P-value. A P-value of 0.01 means only 1% of the sequences can give a better score than that particular match. The extension process in the BLAST program takes up most of the time in a BLAST search. The improved version of the original BLAST, gapped BLAST, uses a shorter word size, but only extends on diagonals of two hash hits with a certain gap length. The elimination of single hash hit extensions make gapped BLAST run faster than its predecessor. There is always a trade off between speed and sensitivity. A corresponding adjustment in the parameters (word size, gap distance, etc.) was made to maintain sensitivity but increase speed of the search.

Use a search type of alignment program like BLAST can identify homologues in different species. However, the analysis to determine the most conserved region or domain has to rely on multiple sequence alignment programs. The basic problem of multiple sequence alignment is that no dynamic programming approach can be used. Most multiple alignments begin with pairwise alignment and build the alignment progressively starting with the most similar sequence pair and following branching order in phylogenetic tree. Multiple sequence alignment is a very important tool in phylogenetic analysis, protein structure prediction and family tree analysis. Clustal is one of the most popular multiple alignment programs. Multiple alignment programs like Clustal also enable one to find a conserved motif and identify the most important amino acid(s) in the motif. This is extremely helpful when one wishes to carry out mutagenesis experiment on a protein of interest.

With a better understanding of molecular biology, many different parameters are being incorporated into alignment programs to provide more powerful and sensitive comparison. The simplest matrix is the identity matrix. However, not all mutations or substitutions are equally likely to occur in biological systems. A parameter to make this distinction would be helpful to determine how likely for a protein to evolve from its ancestor (eg. transition versus transversion). Similarity or substitution matrices are more informative and useful than identity matrices. Taking the likelihood of a particular amino acid substitution and frequency of occurrence of a particular amino acid substitution into consideration, Dayhoff, M.O. et al. developed the PAM matrix to account for mutation probability. Since mutation probability increases with time, one can calculate the evolutionary distance between two proteins. Since evolutionary rate is not uniform over the whole of a protein sequence, Henikoff, S. et al. developed the BLOSUM matrices in 1992. This method uses blocks of sequence fragments instead of full protein sequences to find percent identity among different proteins. Proteins of certain level of identity are clustered to generate a matrix. For example, BLOSUM70 is derived at the 70% identity level. The introduction of PAM and BLOSUM series of matrices into sequence alignment programs makes aligned sequences more biologically relevant.

While sequence alignment is one-dimensional, structural alignment is three-dimensional. The basic principle of structural alignment is the same as sequence alignment. The application of similarity matrix and dynamic programming is the same, but the fundamental premise of dynamic programming is violated due to the fact that structural alignment is three-dimensional. In addition to aligning amino acids between the two sequences, their coordinates in a three-dimensional system (3 axes, rotation and translation = 6 parameters) are also taken in account. To achieve optimal alignment, the distance between two structures has to be minimized. The distance can be calculated in terms of root mean square, RMS. The ultimate goal of structural alignment is to minimize the RMS value upon superposition of two structures. Therefore, aligning an additional amino acid can change the position of previous alignment due to translation and rotation along three axes. This violates the basic principle of dynamic programming which states that once the best alignment at a given pair of position is achieved, the score for that particular alignment would not change by aligning an addition pair of residues at other positions. RMS superposition is commonly used in many structural alignment programs, but other methods such as comparison of distance matrices and structural hashing are also used.

Many advancements have been achieved in sequence alignment in the recent years. With the exponentially growing database, the development of faster alignment methods such as gapped BLAST and fast processor has kept up the pace. The corresponding structural alignment methods are not as robust as the one-dimensional sequence alignment due to additional parameters. In my opinion, if we can incorporate our knowledge about protein structure into structure alignment programs, it will make them run faster. There has been a tremendous amount of effort attempting to classify protein structures. Many fold libraries are generated using different methods such as Scop, Cath and FSSP. Proteins with the same major fold should be able to superimpose at certain orientation so that the common fold is aligned to give a minimal RMS value. If we orient each fold to a defined orientation and set it as a reference, the effort to orient proteins for alignment will be minimized. Even if the initial alignment is not optimal, only small adjustments are required to achieve the best alignment (for example -/+ 10 degrees in rotation instead of 360 degrees). Since proteins are composed of folds, one should be able to align structure according to folds. In addition, many well-characterized folds contain conserved domain or signature residues that can be used to perform structure hashing to find proteins in the same family. With all the efforts that have been and are being put in to solve the mystery of protein folding, we hope that one day we will be able to predict tertiary structure of a protein from its primary sequence.