Barry Fine

MB&B 452a: Bioinformatics

Professor Gerstein

10 October 1999

BLAST and Subtle Homologies

Over the past several years, refinements in the Basic Local Alignment Search Tool or BLAST, have sought to uncover homologies and functional relationships between proteins, previously recognizable only through tertiary and quarternary structure comparisons. The original BLAST program uses hashing and hit extension to generate a maximal segment pair, based primarily upon scoring total amino acid homology between the query sequence and the database sequence. However, protein families are not solely limited by global similarity. As was demonstrated with two related ATPases, CED4, a regulator of apoptosis in C. elegans, and Apaf-1, a plant disease resistance protein, standard BLAST programming is not sufficiently sensitive to detect potentially implicating and helpful similarities between proteins. Adjustments have been made in the algorithm of the BLAST program to compensate for this deficiency. Two of the more commonly used variations on BLAST, PSI-BLAST and PHI-BLAST, will be discussed here.

In 1997 Stephen Altschul proposed a new algorithm for BLAST, modifying the original heuristic by introducing a position-specific matrix for scoring purposes. According to Altschul, position-specific matrices are important in two respects - the replacement of the standard substitution matrix (like PAM 250 or BLOSSUM 62) with a matrix based on the probability of amino acids occurring in certain pattern positions, and the “precise definition of the boundaries of important motifs.” Recognizing that certain positions for amino acids are more conserved than others and that this positioning is one of the key homologies that single-pass comparisons do not pick up, PSI-BLAST constructs its own matrix anew through an iterated profile search.

In constructing a profile, PSI-BLAST can either be run automatically or iteratively. In the latter, the interface enables the user to guide the search parameters with respect to the cutoff E value, gap costs and substitution cost. Also, after the initial construction of a profile, PSI-BLAST allows the user to include protein families (of those proteins who had a "convergence”) for successive iterations. Thus the construction of a profile can be molded somewhat by the program-user.

The scoring matrix is derived from a successive alignment of motifs rather than an already constructed matrix as would be used in the original BLAST. PSI-BLAST seeks to use such motifs for conserved homologies in protein families. As such, sequences are not weighted equally. A set of protein sequences which might occur in the same family, carries more weight than a single sequence. By constructing a scoring matrix predicated on the conservation of motifs instead of straight homology, PSI-BLAST is able to discover subtle similarities between otherwise unrelated proteins. While an original BLAST search would produce a multiple alignment of typically 500 proteins (top scores above the threshold), all these hits could potentially be based on one domain of the protein. It is the weaker, yet biologically significant relationships which PSI-BLAST searches for (Park J et al. and Aravind et al.).

The investigation for such delicate relationships is of incredible utility when trying to probe superfamilies of proteins. As described by Altschul et al. (1997), PSI-BLAST was used to verify the superfamily of proteins containing the BRCT domain, a motif found in cell cycle control related proteins. An alignment formed by original BLAST of the BRCT domain in the C-terminus of BRCA1, found mostly BRCA1 sequence hits. Iterations with PSI-BLAST, on the other hand, produced a wealth of alignments, including newly entered sequences such as H. sapiens KIAA0259, with 8 such domains, as well as homologs from yeast, bacteria, plants, and worm.

The power of these searches lies in the hidden biological relationships between not only proteins of unknown function, but characterized proteins as well. For instance, the PSI-BLAST iterations discovered the BRCT domain within a human protein called Pescadillo, whose zebrafish ortholog has been implicated in embryonic development. Thus, without a single experiment, the BRCT domain has been linked, in even the remotest way, to zebrafish development. Though tenuous, it is from these investigative leads that scientific research will harness great benefit and expediency from PSI-BLAST.

In the continuing search for protein families, BLAST has been altered yet again. What has been termed Pattern-Hit Initiated (PHI) BLAST has been developed with a different approach than its PSI brother. Instead of formulating its own matrix to generate HSP’s, PHI-BLAST limits the number of proteins queried in the database using pattern seeds. Computationally more efficient than PSI-BLAST, PHI-BLAST minimizes the search database by initially scanning the database for proteins which include the pattern (usually a short residue sequence) and then limiting the alignment to include only those proteins which show the pattern.

PHI-BLAST operates under two related premises. The first is that a few amino acid residues of the query protein are of biological importance. Secondly, important residues will be conserved between related proteins. Limiting the alignment search to database of those proteins containing the position-specific pattern has a threefold effect: reducing background noise in the query search, revealing otherwise hidden relationships between proteins, and implicating those residues beyond their laboratory-discovered function.

As an example of the power of PHI-BLAST, described by Zhang et al. (1998), PHI-BLAST was employed to find proteins with a structural relationship to the HS90-like ATPase domains of such proteins as MutL (DNA repair), type II isomerases, histidine kinases, and of course HS90 proteins. Four patterns were used separately to probe the NCBI non-redundant protein sequence database: [G/A]xxxxGK[S/T], hxhxDxGxG (h=[ILVMF] hydrophobic residues), DhDhhh, and QxxGRx[G/A]R. Non-trivial hits included a C. elegans protein ZC155.3, a possible ortholog of “Bovine synaptocanalin I.” According to Zhang, “the synaptocanalin domain apparently was fused to the worm protein by exon misassembly.” A previously undescribed human protein, KIAA0136, as well as a plant homolog, were also shown to be non-trivial hits.

With the current and future completions of genomes, both PSI-BLAST and PHI-BLAST will become increasingly important as an extension and groundwork for molecular biologist and biophysicists. As the current technology within molecular biology will be no match for the imminent profusion of genomic information, computers will have to replace, to a degree, the lab bench. Under the theory and logic behind orthologs and protein superfamilies, demonstrating conserved domains and patterns will provide an incredibly rich yet fast understanding of protein function. This will prove to facilitate the molecular characterization of proteins and their function by providing some added direction for molecular characterization.

List of Works Cited:

Alschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ, Nucleic Acids Res 1997 Sep 1;25(17):3389-402

Aravind L, Koonin EV, J Mol Biol 1999 Apr 16;287(5):1023-40

Park, J., Karplus K, Barrett C, Hughey R, Haussler D, Hubbard T, Chothia C, J Mol Biol 1998 Dec 11;284(4):1201-10

. Zhang Z, Schaffer AA, Miller W, Madden TL, Lipman DJ, Koonin EV, Altschul SF, Nucleic Acids Res 1998 Sep 1;26(17):3986-90