10 October 1999
Over the past several years,
refinements in the Basic Local Alignment Search Tool or BLAST, have sought to
uncover homologies and functional relationships between proteins, previously
recognizable only through tertiary and quarternary structure comparisons. The original BLAST program uses hashing and
hit extension to generate a maximal segment pair, based primarily upon scoring
total amino acid homology between the query sequence and the database
sequence. However, protein families are
not solely limited by global similarity.
As was demonstrated with two related ATPases, CED4, a regulator of
apoptosis in C. elegans, and Apaf-1,
a plant disease resistance protein, standard BLAST programming is not
sufficiently sensitive to detect potentially implicating and helpful
similarities between proteins. Adjustments have been made in the algorithm of
the BLAST program to compensate for this deficiency. Two of the more commonly used variations on BLAST, PSI-BLAST and
PHI-BLAST, will be discussed here.
In 1997 Stephen Altschul
proposed a new algorithm for BLAST, modifying the original heuristic by
introducing a position-specific matrix for scoring purposes. According to Altschul, position-specific
matrices are important in two respects -
the replacement of the standard substitution matrix (like PAM 250 or
BLOSSUM 62) with a matrix based on the probability of amino acids occurring in
certain pattern positions, and the “precise definition of the boundaries of
important motifs.” Recognizing that
certain positions for amino acids are more conserved than others and that this
positioning is one of the key homologies that single-pass comparisons do not
pick up, PSI-BLAST constructs its own matrix anew through an iterated profile search.
In constructing a profile,
PSI-BLAST can either be run automatically or iteratively. In the latter, the interface enables the
user to guide the search parameters with respect to the cutoff E value, gap
costs and substitution cost. Also,
after the initial construction of a profile, PSI-BLAST allows the user to include
protein families (of those proteins who had a "convergence”) for
successive iterations. Thus the
construction of a profile can be molded somewhat by the program-user.
The scoring matrix is
derived from a successive alignment of motifs rather than an already
constructed matrix as would be used in the original BLAST. PSI-BLAST seeks to use such motifs for
conserved homologies in protein families.
As such, sequences are not weighted equally. A set of protein sequences which might occur in the same family,
carries more weight than a single sequence.
By constructing a scoring matrix predicated on the conservation of
motifs instead of straight homology, PSI-BLAST is able to discover subtle
similarities between otherwise unrelated proteins. While an original BLAST search would produce a multiple alignment
of typically 500 proteins (top scores above the threshold), all these hits
could potentially be based on one domain of the protein. It is the weaker, yet biologically
significant relationships which PSI-BLAST searches for (Park J et al. and Aravind
et al.).
The investigation for such
delicate relationships is of incredible utility when trying to probe
superfamilies of proteins. As described
by Altschul et al. (1997), PSI-BLAST was used to verify the superfamily of
proteins containing the BRCT domain, a motif found in cell cycle control
related proteins. An alignment formed
by original BLAST of the BRCT domain in the C-terminus of BRCA1, found mostly
BRCA1 sequence hits. Iterations with
PSI-BLAST, on the other hand, produced a wealth of alignments, including newly
entered sequences such as H. sapiens
KIAA0259, with 8 such domains, as well as homologs from yeast, bacteria,
plants, and worm.
The power of these searches
lies in the hidden biological relationships between not only proteins of
unknown function, but characterized proteins as well. For instance, the PSI-BLAST iterations discovered the BRCT domain
within a human protein called Pescadillo, whose zebrafish ortholog has been
implicated in embryonic development.
Thus, without a single experiment, the BRCT domain has been linked, in
even the remotest way, to zebrafish development. Though tenuous, it is from these investigative leads that
scientific research will harness great benefit and expediency from
PSI-BLAST.
In
the continuing search for protein families, BLAST has been altered yet
again. What has been termed Pattern-Hit
Initiated (PHI) BLAST has been developed with a different approach than its PSI
brother. Instead of formulating its own
matrix to generate HSP’s, PHI-BLAST limits the number of proteins queried in
the database using pattern seeds.
Computationally more efficient than PSI-BLAST, PHI-BLAST minimizes the
search database by initially scanning the database for proteins which include
the pattern (usually a short residue sequence) and then limiting the alignment
to include only those proteins which show the pattern.
PHI-BLAST operates under two
related premises. The first is that a
few amino acid residues of the query protein are of biological importance. Secondly, important residues will be
conserved between related proteins.
Limiting the alignment search to database of those proteins containing
the position-specific pattern has a threefold effect: reducing background noise
in the query search, revealing otherwise hidden relationships between proteins,
and implicating those residues beyond their laboratory-discovered
function.
As an example of the power
of PHI-BLAST, described by Zhang et al. (1998), PHI-BLAST was employed to find
proteins with a structural relationship to the HS90-like ATPase domains of such
proteins as MutL (DNA repair), type II isomerases, histidine kinases, and of
course HS90 proteins. Four patterns
were used separately to probe the NCBI non-redundant protein sequence
database: [G/A]xxxxGK[S/T], hxhxDxGxG
(h=[ILVMF] hydrophobic residues), DhDhhh, and QxxGRx[G/A]R. Non-trivial hits included a C. elegans protein ZC155.3, a possible
ortholog of “Bovine synaptocanalin I.”
According to Zhang, “the synaptocanalin domain apparently was fused to
the worm protein by exon misassembly.”
A previously undescribed human protein, KIAA0136, as well as a plant
homolog, were also shown to be non-trivial hits.
With the current and future
completions of genomes, both PSI-BLAST and PHI-BLAST will become increasingly
important as an extension and groundwork for molecular biologist and
biophysicists. As the current
technology within molecular biology will be no match for the imminent profusion
of genomic information, computers will have to replace, to a degree, the lab
bench. Under the theory and logic
behind orthologs and protein superfamilies, demonstrating conserved domains and
patterns will provide an incredibly rich yet fast understanding of protein
function. This will prove to facilitate
the molecular characterization of proteins and their function by providing some
added direction for molecular characterization.
List of Works Cited:
.
Alschul SF, Madden TL, Schaffer AA,
Zhang J, Zhang Z, Miller W, Lipman DJ, Nucleic Acids Res 1997 Sep
1;25(17):3389-402
Park, J., Karplus K, Barrett C, Hughey R, Haussler
D, Hubbard T, Chothia C, J Mol Biol 1998 Dec 11;284(4):1201-10
. Zhang Z, Schaffer AA, Miller W, Madden TL, Lipman DJ,
Koonin EV, Altschul SF, Nucleic Acids Res 1998 Sep 1;26(17):3986-90