Joshua S

Joshua S. Weinger

12/15/2000

MB&B

Identification of Protein Homology By Hidden Markov Models

In the past, experimental data about a protein’s function usually preceded the acquisition of any sequence information of the protein or the gene encoding the protein. Knowledge of a protein’s function almost always preceded any knowledge of its structure. In recent years, however, the generation of new sequence data, fueled particularly by the genome projects, has far outpaced the rate at which experimental information about newly discovered genes could be gathered. Information about a protein’s structure and function can be gained by establishing relationships between the protein and proteins of known structure or function. Methods for identifying such relationships have become important tools for genome annotation and characterization of new proteins. Among the most powerful of these methods are algorithms employing hidden Markov models (HMMs) to describe the consensus sequence of a protein family. Hidden Markov models have a number of advantages over other types of sequence similarity searches.

A Markov process is a stochastic process in which the past and future conditions of the system are independent of the each other given the present conditions. The successive state of the system depends only on the current state in a probabilistic manner. In a Markov chain, which is a type of linear Markov process, there may be an infinite set of possible progressions from a given starting state, with each possibility having a finite probability.[1] When Markov models are used in bioinformatics to model a family of biological sequences, the “states” of a Markov chain each represent one amino acid or nucleotide position in the sequence. The model dictates the probability of finding each amino acid at every point of the consensus sequence, as well as the probability of finding an insertion or deletion at each point in the sequence.[2]

In the Markov models typically used for sequence alignment there are three types of states. M states, or match states, represent the consensus sequence of the family of proteins used to generate, or “train,” the model. Each match state in the sequence has a matrix of probabilities for inserting each of the 20 amino acids, the sum of which is always 1. D states, or deletion states, represent a deletion of a position along the string of M states. I states, or insertion states, represent an insertion of amino acids between positions of the consensus. Each I state also has a matrix of amino acid probabilities, which can either be determined by the training set of sequences, or simply use a standard matrix such as a PAM matrix. In addition to having a matrix of amino acid probabilities at each M and I state in the model, there is a set of probabilities for moving from each state to the possible next state. For example, from a certain match state M_n there are three possible movements: to the next match state, M_n+1, to the next deletion state, D_n+1, or to an insertion state, I_n. This transition is made in a probabilistic fashion, and the sum of the three probabilities is 1. From an insertion state it is possible to go to the next match state, the next deletion state, or back to the insertion state, which allows for multiple insertions at a single position.² If you start at the beginning of the chain and move forward through it, a sequence is generated. The model is known as a “hidden” Markov model because the data we deal with are the amino acid sequences. The sequence of states used to achieve this sequence is not seen, but may be inferred by determining the path with the highest probability of generating a particular sequence.[3] See Figure 1. for a diagram of this model.

In order to be useful for finding homologous sequences, the Markov model described above must be trained to specifically recognize a protein family. In other words, the parameters of the model must be set so as to maximize the probability of generating members of a given protein family. A method for training HMMs was designed by P. Baldi and colleagues and demonstrated using three protein families which have many know members: globins, immunoglobins, and kinases. This allowed them to train their HMMs with a subset of the families and then demonstrate that the resulting HMM was able to accurately distinguish a member of the family from a non member. Their training method involved starting with an approximate initial model. Then, for each sequence in the training set they determined the most probable path through the model that would generate the sequence. As the most probable paths were found the parameters of the model would be automatically adjusted to better fit the training sequences. The training set was cycled through continuously until there were no longer significant changes.[4]

Once an HMM has been trained to recognize members of a protein family, finding new members of that family in a database is simply a matter of going through the database, and calculating the probability of generating each sequence from the HMM. Sequences whose probability of generation is above a certain threshold are identified as members of the family. This method has a number of advantages over other methods for finding homologous sequences in a database. Pairwise searching methods such as BLAST, which take a query sequence and compare it individually to every sequence in a database, have been a very popular and effective way to find proteins with a high degree of homology to the query sequence. However, the effectiveness of such methods drops off when looking for more distantly related proteins. Pairwise comparison methods cannot utilize any information regarding the relative importance of each amino acid to the structure and function of the query protein, and thus the likely level of conservation of each amino acid. These algorithms must therefore assign equal importance to each residue of the query.³ Algorithms that use multiple sequence alignments as their query, however, are able to assign greater importance to certain amino acids and identify what substitutions or insertions are acceptable. Such methods allow for a looser definition of the query sequence, and are thus able to identify related proteins that may be too diverged from any single sequence in the query alignment to allow identification by pairwise comparison.³

In addition to heightened sensitivity over pairwise searching methods, HMM methods have a number of advantages over other multiple alignment-based methods for detecting protein homologies. Standard profile searches query a database using a profile of a multiple alignment, which is basically a consensus sequence that includes information about the frequency of finding each amino acid at each position of the consensus. One disadvantage of this type of method is that it requires a pre-existing multiple alignment, whereas HMM methods do not. In fact, a multiple alignment of the training sequences is automatically generated as a result of the HMM training process. The sequences can be aligned based on their most probable path through the HMM.³ Perhaps more important than an HMM’s ability to generate its own multiple alignment is the fact that all “penalties” assessed to a potentially homologous sequence are based explicitly on the relationships between the training sequences, and are never arbitrary. Standard profile methods generally assess a standard gap penalty, and then a standard gap extension penalty regardless of the position within the sequence that the gap occurs. This type of model fails to consider the fact that insertions are often more likely to occur at certain positions of the consensus sequence. In an HMM algorithm penalties are represented by reductions in the probability that a sequence could be generated by the HMM. Positions where insertion is more likely will have a higher t_MI (M to I transition probabilities), so an insertion at this position will have a less harmful effect on the sequence’s overall probability score than an insertion at a position with a low t_MI. Similarly, variable gap extension scores are analogous to variable t_II values.[5] Because of their greater adaptability to the training data, HMM can accommodate protein families with lower degrees of conservation than standard profile methods can.⁴

HMM methods of homologue identification are not entirely without problems. The complexity of HMMs render them more prone to overfitting the training data than some other methods. HMMs allow insertions and deletions at any position based on the behavior of the training data, and are thus more sensitive to the specific behavior of the protein family being modeled. However, this specificity is achieved by an increased number of parameters, and more parameters increases the likelihood that the model caters itself to the specific set of data it is given, loosing the generality necessary to allow for random changes that would certainly be present in target homologous sequences. This problem can be dealt with in part by constraining the model wherever possible, and thereby reducing the necessary parameters. For instance, strings of ungapped blocks (sections of standard profiles) can be inserted into the HMM in regions where insertions are very rare.⁵

Another concern with HMM searching methods may be that in order to perform an HMM directed homology search, a large number of training sequences with known homology are needed to train the model. This is not available for many protein families that may be of interest. The SAM-T98 method for detecting protein homologies, described by K. Karplus and colleagues, deals with this limitation, allowing an HMM based search with only a few, or even a single query sequence. According to this method an initial HMM is designed based on the few available homologues. If only a single sequence is to be used, a WU-BLAST is performed with a loose score cut-off to identify initial potential homologues. These can then be used as the seed family to design the initial HMM. The initial HMM is then used to search the non-redundant database for homologues. Once homologues with reasonable local alignment with the HMM are found in this first search, the HMM is then redesigned, incorporating the newly found homologues into the training. This cycle can be repeated a number of times, but it has been found that three cycles is generally sufficient. Finally the resulting HMM can be used to search any database for distant homologues of the initial query sequence.[6]^,[7]

Hidden Markov models have proved to be a very effective tool for identifying homology between proteins, even when the level of homology is very low such that it cannot be detected by other methods of searching. They represent a strictly regimented method for modeling a family of proteins that recognizes many subtleties of the familial relationship that are missed by other algorithms. Considerations such as gap penalties, and perhaps less apparently, the 3D environment of the amino acids are built into the model automatically based on the probabilities of finding certain traits (ie. certain residues of long insertion, etc) at specific locations within the sequences. Because HMMs perform so well at recognizing subtle relationships between data sets, they have come into use in a number of different biological applications in addition to homology searches. For instance, they have been used in the construction of phylogeny trees (identification of phylogenic relationships)[8] and for identification and differentiation of signal and anchor peptides.[9]