One of the powerful tools for sequence analysis that has been rapidly developing for the past 30 years is alignment

One of the powerful tools for sequence analysis that has been rapidly developing for the past 30 years is alignment. Properly approached molecular sequence data is a rich source of knowledge capable of teaching much about the structure, function and evolution of biological macromolecules. To efficiently realize this potential some understanding of the process of and theoretical basis for sequence comparison is needed as well as a variety of practical tools to access and manipulate the data (1).

To align two or more sequences means to depict them one above the other, allowing gaps introduction between certain nucleotides or amino acid residues. Such way of depicting the sequences is widely spread in molecular biology because it implies that common origin or function binds positions standing in one column. The simultaneous alignment of many nucleotide or amino acid sequences forms a common prelude to phylogenetic analysis of sequences, the prediction of secondary structure (DNA or protein), the detection of homology between newly sequenced genes and existing sequence families, the demonstration of homology in multigene families, and the finding of candidate primers for PCR (8). For such a widely required analysis it took a long time for practical methods to appear. Earlier the alignments were made by eye simply to arrange the data well. Sometimes it meant that important criteria were satisfied but not formulated properly. However alignments pleasant to a biologist’s eye didn’t seem to have any other value. Widespread expressions "aligned to maximize homology" or "gaps introduced to maximize homology" have often masked quite an awkward attempt to maximize the number of matches and to minimize the quantity of gaps allowed (inserted). The goal of such actions remained unclear and the notion of "homology" undefined.

One of the first mathematical questions arising in the regard of the above is of complexity of sequence alignment. A naïve approach is to look through all possible alignment variants and to further estimate their quality. However up until 1987 it was a standard practice to construct alignments manually. This is very tedious and error prone (8).

If scrutinizing the sequence alignments from the combinatorial point of view it becomes obvious that the number of alignments is huge and therefore their straightforward sorting is impossible.

To prove it the example of two sequences is considered below (9). Let a=a₁a₂….a_n and b=b₁b₂…b_m be two sequences of the length n and m respectively. While aligning the empty symbols [0] are inserted, so that new sequences have equal length L. New sequences a*=a₁*a₂*….a_n* (derived from a) and b*=b₁*b₂*…b_m* (derived from b) written one below the other make the alignment:

a₁*a₂*…a_L*,

b₁*b₂*…b_L*.

To illustrate the above let a=ATAAGC and b=AAAACG. One of the many opportunities to align them is to make a*=[0]ATAAGC[0] and b*=AAAA[0]CG. The alignment will look as follows:

a*=[0]ATAAG C[0]

b*= AAAAA[0]CG

b₁=A maybe regarded as an insertion in to the first sequence or as the removal from (deletion in) the second sequence. Further a₂=A matches b₂=A, but a₃=T and b₃=A make a mismatch (9).

Michael Waterman (3,9) considered the problem of the number of possible alignments of given sequences a and b. Since the match of [0] and [0] is not allowed (there is no sense in matching two deletions) max {n,m} <= L <= n+m.

Let ‘s designate f(i,j) as the total number of all possible alignments of sequence of i symbols with sequence of j symbols. Obviously there are only three variants for the alignment of the last column:

1) …a_n, 2)…a_n, 3)…[0]

…[0] …b_m …b_m,

where the first opportunity is insertion / deletion of a_n, the second – a_n and b_m "gluing" (both matches and mismatches are allowed), the third one – insertion / deletion of b_m. Therefore

f(n,m) = f(n-1,m) + f(n-1,m-1) + f(n,m-1).

Later H.T. Laquer proved the following theorem:
Let’s define f(n,m) as earlier. Then if n the following correlation is true f(n,n)@(1+2^1/2)²ⁿ⁺¹n^1/2, where c(n) @d(n) means that lim _n c(n)/d(n)=1.

Thus for any two sequences of the length 1000 there exist

f(1000,1000) @ (1+ 2^1/2)²⁰⁰¹1000^1/2= 10^767.4… alignments.

A lot of different tools emerged in attempts to perfect the alignment techniques. Among the newly generated approaches 4 basic can be defined:

Dot matrix plots

Optimal pairwise alignment using dynamic programming

Fast, approximate techniques for detecting local similarities

Multiple alignment.

All of them serve different purposes and are considered below.

Dot Matrix Methods.

Gibbs and McIntyre are usually credited with introducing dot matrix analysis for sequence composition.

Dot matrix methods have enjoyed a wide variety of applications including analysis of gene structure and genome organization, identification of local sequence features, detection of internal sequence repeats, RNA folding, and molecular evolution (1). In most simple form to compare two sequences A and B, sequence A is written out vertically with each letter (residue or base) representing a row, and sequence B is written out horizontally with each letter representing a column. Then, each letter (row) of sequence A is compared to each letter (column) of sequence B and a dot is placed at the intersection of each row and column where the letters are identical. One then visually inspects the matrix for meaningful patterns of dots that represent certain sequence features (1).

Self comparison is also the method of choice for detecting and characterizing internal repeats in proteins and nucleic acids.

The remaining dots constitute the background noise of the analysis and are simply a reflection of the amino acid composition of the sequences. For nucleic acid sequences, which are composed of only four letters, this composition – dependent background noise is an even more serious problem and various "filtering" techniques have been devised to improve the signal-to-noise ratio (1).

The Window Approach and Output Filtering

For a coding region, it is possible to increase the specificity of sequence comparison simply by first translating it into its corresponding amino acid sequence and then performing the analysis. This is the universal case, not just for dot matrix method. For non-coding sequences and for many proteins other filtering techniques are necessary. The most widely used procedure compares the sequences using overlapping, fixed-length "windows" and requires that the comparison achieve some minimal threshold value or score, summed over the window before qualifying as a match (1). Most programs provide a default window that is adequate for preliminary analysis. For coding sequences in particular there are some theoretical grounds on which to base a span length selection. For example an average exon size or an average secondary or supersecondary protein structural element size could be used. Also it is possible to choose the window size rationally based on the purpose of analysis. For example, gene promoters, short stem-loop structures or enzyme active sites require a small window. For RNA/DNA secondary structure one might choose the length of a stable duplex. A multiple of codon length (for example, 9 nucleotides) is useful for distinguishing coding sequences from introns. For sequences that contain internal repeats, using a multiple of the repeat unit length is often desirable to suppress background noise, although for the repeats that are polymorphic in length, using a window that is about half the size of the average repeat length is required for maximum sensitivity. In general, the larger window, the lower the background noise (1).

A more sensitive way to measure the local similarity of protein sequences uses a scoring system that takes into account the physical, chemical, structural and/or evolutionary similarity of the amino acid residues.

The next step after sequence comparison is estimation of statistical significance, which has most commonly been accomplished by Mote Carlo analysis. In this methods one or both sequences are randomized (shuffled or scrambled) and the analysis repeated a sufficient number of times to calculate a mean and a standard deviation for the distribution of the scores. First, a comparison of the two sequences is performed, and a distribution of local similarity scores and their mean are obtained. Then one or both of the sequences are "shuffled" (maintaining the same nucleotide or amino acid compositions) and the pairwise comparison is repeated to obtain a score distribution for the randomized sequences. Normally the procedure is carried out 100 or more times to stimulate the rare events.

A simple extension of this approach computes a new score, known as Z score, that scales raw scores based upon their distance from the mean of the randomized sequence score distribution:

Z=(x-m)/s,

where x is the local alignment score, m is the mean of random scores and s is the standard deviation of random scores (1).

A major advantage of using dot matrix method to visualize sequence relationship is that it allows to see the similarities almost right away. It can be used for detecting potential secondary structure in nucleic acids.

Many protein sequences contain internal homology or repeats. A sensitive dot matrix analysis is one of the best ways to look for internally repetitive protein sequences. This was done by Gibbs and McIntyre who compared the 2a-haptoglobin sequence with itself and showed that it was composed of two approximately 60-residue repeats. Dot matrix analysis of the TFIIIA sequence of Xenopus laevis was instrumental in the discovery of the zinc finger DNA binding motif.

Dot matrix analysis is a simple method, but it relies on the investigator to recognize patterns indicative of similarity and to add gaps to the sequences to achieve an alignment. However, it is difficult to be sure that optimal alignment is made by hand. Thus assessments of homology are almost always made on the basis of the alignments produced by dynamic programming approaches. In theoretical programming this problem is known as line editing.

Dynamic Programming Methods

Needleman and Wunch initially introduced this approach to molecular biologists. According to Waterman (9) the authors didn’t know that their algorithm is based on the method of dynamic programming proposed by Richard Bellman.

According to their approach the best alignment that ends at a given pair of bases or residues is the best alignment of the sequences up to that point, plus the score for aligning the two additional bases of residues. Mathematically this is depicted as follows:

for sequence 1 and 2 numbered 1 to i and 1 to j respectively

S_ij = s_ij+ maxS_kl

1<=k<=i

1<=l<=j

where S_ij is the score for the alignment ending at i in sequence 1 and j in sequence 2

s_ijis the score fir aligning i with j.

The main merit of the article is in precisely formulated criterion for the optimal alignment. Needleman – Wunch criterion is based on maximizing the "alignment weight" – the sum of the weights of inserted or deleted symbols, the weights of mismatches (the negative likelihood) and the weights of matches (the positive likelihood). The distance D (a,b) between the two sequences a and b is set by the alignment with the minimal weighted sum of substitutions, insertions and deletions. The advantage of such definition is in that it determines the structure of metric space in the sequence space. D (a.b) satisfies the distance axioms:

D (a,b)=0, if and only if a=b.

D (a,b)=D (b,a) (symmetric).

D (a,b) <= D(a,c) + D(c,b) dor any given c (inequity of the triangle).

The basis for using such metric system is that the distances between sequences are often used for creating evolutionary trees.

In the original article the smallest unit for comparison suggested was a pair of amino acids, one from each protein. The maximum match was defined as the largest number of amino acids of one protein that can be matched with those of another protein while allowing for all possible deletions (2). The maximum match can be determined by representing in a two-dimensional array, all possible pair combinations that can be constructed from the amino acid sequences of proteins, A and B being compared. If the amino acids are numbered from the N-terminal end, A_j is the jth amino acid of protein A and B_i is the ith amino acid of protein B. The A_j represents the columns and the B_i the rows of the two-dimensional array, MAT. Then the cell, MAT_ij, represents a pair of combination that contains A_j and B_i. Every possible comparison can now be represented by pathways through the array. An i or j can occur only once in a pathway because a particular amino acid cannot occupy than one position at one time (2). In the simplest method MATij is assigned a value, one, if A_j is the same kind of amino acid as B_i; if they are different amino acids MAT_ij is assigned the value zero. The sophistication of the comparison is increased if, instead of zero or one each cell value is made a function of the composition of the proteins, the genetic code triplets representing the amino acids, the neighboring cells in the array or any theory concerned with the significance of a pair of amino acids. The maximum–match pathway then, is that pathway for which the sum of the assigned cell values (less any penalty values) is the largest (2). The maximum-match pathway can be obtained by beginning at the terminals of the sequences (i=y, j=z) and proceeding toward the origins first by adding to the value of each cell possessing indices i=y-1 and/or j=z-1, the maximum value from among all the cells which lie on a pathway to it. The process is repeated and the procedure is called traceback. The gap penalty is introduced for matching the latter with blank.

Although dynamic programming alignment techniques are guaranteed to find an optimal alignment there may be other equally optimal alignments, and some of them maybe more biologically relevant (1). The problem of equivalent alignments is worse for nucleic acid sequences than for protein sequences because nucleic acids are usually aligned using a score system that gives all identical comparison the same score. Protein alignments are usually made based on a scoring system that gives a partial score for amino acid residues that are chemically or mutationally similar.

To produce sensible alignments gap penalties must be used for longer sequences. Gap penalties were originally applied either as a single penalty, regardless of the length of the gap or as a penalty for each gap character inserted into sequences. More recently, most programs have relied on gap penalties with both a length-dependent and a length-independent term.

w_x=g+lx

where w_x is a penalty for a gap of length x

g is the length-independent term (gap opening penalty)

l is the length-dependent term (gap extension penalty) (1)

Alignment programs vary in their treatment of gaps introduced at the end of the sequences. For sequences that are known to be homologous it makes sense to weight end-gaps.

Different derivatives of the method are useful. For example solving the problem of how to make a short sequence better suit the long one. This is often necessary while finding important regulatory sequences, such as bacterial promoters(9). Also one of the applications of the method is finding the similar patterns in non-homologous sequences. For instance there are extended similar fragments in DNA of some viruses and their host-organisms.

The primary drawback to dynamic programming methods is that they require a considerable amount of computation (1). This limits their usefulness for tasks such as database searching. This can be speed up by two approaches – hashing or subdivision of sequences.

Fast Methods

Hashing is accomplished by building a table of the locations of short subsequences, or words, in the query sequence. Segments from the database are then sequentially tested against this hash or neighborhood table. If a match is found it corresponds to a potential alignment and is saved. The FASTP and FASTA programs have been proven to be very useful implementations of this type of algorithm. In the case of FASTA the hash table is built to require exact matches between the query and target sequence over the full length of the hash entry (1). This size of the hash entry (word size) is adjustable as an input parameter k-tuple. The use of the larger k-tuple increases the speed of the algorithm dramatically, but the requirement for an exact match over the entire length of k-tuple means that many significant alignments between moderately diverged sequences will be missed if a large k-tuple is used (for proteins greater than two). The FASTA passes through four steps as it searches the database for the similar sequences.

Regions of diagonals with the highest density common words are found.

Ten regions with the higher density of words are rescored using the mutation data matrix. The ends of the regions are then trimmed to remove residues that do not contribute to the highest score, resulting in a set of partial alignments without gaps. The best scoring region is reported as "init1" score.

The regions that are near enogh together are joined into a single region. The score for this region "initn" includes the gap penalties needed to join initial regions is used to sort the output of the search.

The dynamic programming alignment is performed in a band 32 residues wide in a region around the best ("init") initial region. This is the "opt" or optimized score (1).

Search strategies based on complete or approximate alignment of sequences including the presence of insertions or deletions have been widely utilized. Then a very efficient neighborhood search algorithm (Basic Local Alignment Search Tool, or BLAST) has been devised that examines only ungapped segments , but ranks matches statistically. BLAST programs have been written to compare protein or DNA queries with protein or DNA databases in any combination, with DNA sequences often undergoing conceptual translation before any comparison is performed. BLAST permits a tradeoff between speed and sensitivity, with the setting of threshold parameter, T. A higher value of T yields greater speed, but also an increased probability of missing weak similarities. The original BLAST program seeks short word pairs whose aligned score is at least T (4,5).

Two modifications of BLAST were developed – gapped BLAST and PSI-BLAST. The former is three times faster than the original program and automatically combines statistically significant alignments produced by BLAST into a position-specific score matrix. The resulting Position Specific Iterated BLAST (PSI-BLAST) is very sensitive to weak but biologically relevant sequence similarities (4,5).

Multiple Sequence Alignment

Multiple alignment has many important applications in sequence analysis. Although it is often used to simply display or summarize the relationships among a group of sequences, multiple alignment also has major roles in protein modeling or structure prediction, studies of molecular evolution, and the detection and quantitation of sequence patterns or motifs (1). The alignment of the entire family of proteins or nucleic acids may provide more information than the alignment of any pair of those sequences. There are several approaches to multiple alignment:

Global alignment methods are most appropriate to sequences that are small (<500 residues), approximately equal in length and share a global, but perhaps distant relationship.

Local alignment methods are preferable for longer sequences that vary greatly in length and may share only isolated regions of similarity separated by variable-length segments of little or no sequence conservation (6).

It is usually not possible to simultaneously align more than ten sequences by the rigorous approach used in the MSA program. For larger numbers of sequences a variety of approximate methods have been proposed. The most useful approaches are based on clustering and have been termed progressive alignment methods (8).

Hidden Markov Models

Pairwise sequence comparison methods such as BLAST and FASTA generally assume that all positions are equally important. Multiple alignments of protein sequence families indicate residues that are more conserved than the others, and the points at which insertions or deletions are more frequent. Three-dimensional structural information allows structural environments to be taken into account when scoring aligned residues. A "profile" (defined as a consensus primary structure model consisting of position-specific residue scores and insertions / deletions penalties) is an intuitive step beyond the pairwise sequence alignment (7).

The problem with profiles is that they are complicated models with many free parameters. To address this problem Hidden Markov Models (HMMs) have been introduced. The HMM is composed of some number of states, which might correspond to positions in a three-dimensional structure or columns of a multiple alignment. Each state "emits" symbols (residues) according to symbol emission probabilities, and the states are interconnected by state transition probabilities. Starting from some initial state, a sequence of states is generated by moving from state to state transition probabilities until an end state is reached. Each state then emits symbols according to that state’s emission probability distribution, creating an observable sequence of symbols (7). HMM-based profiles make two important assumptions:

Pairwise (or higher-order) correlations between sequences are ignored.

Sequences are generated independently from a model.

In the whole HMMs present a good basis for integrating structural and sequence information.

From the illustrated above is it evident that sequence comparison continues to be an active area of mathematical, statistical and computational research and development, and the current generation of database search and sequence alignment tools offer considerable improvements in both efficiency and reliability (1).

The development of diverse alignment methods permits to build and predicts more reliable and relevant biological models, which have not been elucidated experimentally.

A relatively undeveloped capability is the integration of molecular sequence data with other types of data in the scientific literature. Further developments in this field are under way.

References.

M. Gribskov chapter 3. Similarity and Homology.

Needleman S.B., Wunch C.D. A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. J.Mol.Biol. (1970) 48, 443-453.

M. Waterman. Identification of Common Molecular Subsequences. J Mol. Biol. (1981) 147, 195-197.

Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acid Research (1997) Vol.25, no.17, 3389-3402.

Pearson W.R. Effective Protein Sequence Comparison. Methods in Enzymology Vol. 266 chapter 15, 227-259.

Altschul S.F., Boguski M.S., Gish W., Wootton J.C. Issues in Searching Molecular Sequence Databases. Nature Genetics (1994) 6, 119-129.

Eddy S.R. (Washington University School of Medicine). Hidden Markov Models. (1996).

Higgins D.G., Thompson J.D., Gibson T.J. Using CLUSTAL for Multiple Sequence Alignment. Methods in Enzymology (1996) vol.266, 383-402.

Waterman M.S. (editor). Mathematical Methods for DNA Sequences.