nature 4 November 1999
newsandviews
Nature 402, 23 - 26 (1999) © Macmillan Publishers Ltd.

Genomics: Functional links between proteins

ANDREJ SALI

Genome-sequencing projects have accomplished a monumental feat in generating complete lists of the proteins that make up multi-subunit assemblies and signalling pathways in various organisms. These assemblies and pathways must now be mapped, and papers by Marcotte et al.1 and Enright et al.2 (pages 83 and 86 of this issue) take a significant step in this direction. The two groups have developed computational methods that associate proteins through properties other than the similarity between their amino-acid sequences. By comparing phylogenetic (evolutionary) profiles and expression patterns, as well as by analysing domain fusions, the new methods identify proteins that are functionally linked through a metabolic pathway, a signalling pathway or a structural complex (Fig. 1, overleaf). About half of the uncharacterized proteins in yeast -- roughly a quarter of all yeast proteins -- may be partially annotated by these methods1. And, because they do not rely on direct sequence similarity, the methods can assign functions to proteins that lack detectable homologues of known function. They will find many applications in genomics, complementing experiments for the large-scale identification of protein function.

Figure 1 Functional annotation by computation.   Full legend
 
High resolution image and legend (35k)

The critical information needed to construct useful models of pathways and assemblies is provided by experiment, most importantly by proteomics and structural biology. Proteomics aims to identify and characterize complete sets of proteins and protein-protein interactions. It involves large-scale approaches such as the yeast two-hybrid system, mass spectrometry, two-dimensional gel electrophoresis and DNA-microarray hybridization3. The size and complexity of the task can be appreciated by assuming between five and 50 functional links per protein, resulting in 30,000 to 300,000 links for a single yeast cell. Although experiments have characterized about 30% of yeast proteins, they are sometimes not rapid, inexpensive or complete enough. So there is a need to assign function using computational methods.

Computational methods have traditionally assigned function by sequence similarity to a characterized protein4,5. Such annotation is possible because evolution produced families of homologous proteins that share a common ancestor, and thus have similar sequences, structures and, often, functions. Protein comparisons have allowed some insight into the function of another 30% of yeast proteins6. However, functional assignment by homology is limited by two factors. First, it can be applied only to proteins with detectable homologues of known function. Second, it is not always clear what functional properties are shared by the matched proteins, especially for the more distant matches.

The new methods developed by Marcotte et al.1 and Enright et al.2 are not subject to these limitations because they do not depend on sequence similarity between uncharacterized proteins and proteins of known function. Instead, they group proteins that are part of the same pathway or assembly (Fig. 1) and define them as being 'functionally linked'. Marcotte et al. have applied three different classification schemes to the proteins in the budding-yeast genome: phylogenetic profiles7, domain-fusion analysis8 and correlated messenger RNA expression patterns9. The domain-fusion analysis was developed independently by Enright et al., using a new clustering algorithm, and applied to three prokaryotic genomes.

Phylogenetic profiling relies on the correlated evolution of proteins1,7. The evolution of two proteins is correlated when they share a phylogenetic profile, which is defined as the pattern of a protein's occurrence over a set of genomes10. The phylogenetic profile can be calculated rigorously only when several complete genomes are compared. Two proteins that share a similar phylogenetic profile are expected to be functionally linked. So, clustering of proteins based on their phylogenetic profiles can provide information about the function of an uncharacterized protein that is grouped with one or more functionally defined proteins.

The domain-fusion analysis identifies fusion proteins consisting of two non-homologous component proteins found separately in another genome1,2. Such component proteins are expected to interact physically with each other. An interface between two interacting component proteins is more likely to evolve when the proteins are fused in a single chain. A well-known example is fusion of the alpha and beta subunits of tryptophan synthetase from bacteria to fungi. In some respects, the domain-fusion analysis is similar to the use of gene clusters for inferring functional links from gene proximity11,12.

Marcotte et al. have also grouped yeast proteins by correlating their mRNA expression patterns1,9. These patterns were obtained from 97 publicly available DNA chip data sets, which indicate how the expression levels of most yeast proteins change during normal growth, glucose starvation, sporulation and expression of mutant genes. The analysis is based on the expectation that proteins with correlated expression levels over the same series of conditions are functionally linked.

The new functional annotations are often broad, pinning a protein's function down to, say, 'metabolism' or 'transcription'. Even a random pair of proteins have a 50% chance of similar function at such a broad level1. But because the annotations are generally derived from a number of linkages, they are three to eight times more informative than random links -- comparable, in the best case, to experimental determination of protein-protein interactions1. For example, Marcotte et al. established new links for MSH6, a DNA-mismatch-repair protein involved in some colorectal cancers, to the PMS1 mismatch-repair family, mutations in which are also tied to human colorectal cancer, the purine-biosynthetic pathway, two RNA-modification enzymes and an uncharacterized protein family, which may now all be investigated in the light of nucleic-acid repair or modification.

How accurate are the annotations, and what percentage of proteins can they cover? These questions can be only partially addressed, because a reference set of functionally linked proteins is not easily available. Marcotte and colleagues assigned a general function to about half of the 2,557 uncharacterized proteins in yeast13. They estimated that up to 30% of the pairwise predictions contributing to functional assignments were false, although for the subset of predictions made by two or three of the methods together the rate of false positives decreased to 15%.

Enright et al. functionally linked only 215 proteins in three prokaryotic genomes by their domain-fusion analysis, but with very few estimated false positives. Their smaller rate of functional linking seems to be due to the missing links provided by the phylogenetic profiling and mRNA expression methods, which these authors did not use, a stricter definition of fusion events, and the use of fewer proteins to detect fusion. Despite false positives and coarse functional annotation, the computational methods allow experimentalists to concentrate on promising interactions. As more genome data become available, the number of predictions and accuracy of both the domain-fusion analysis and the phylogenetic profiling methods will increase.

The next step is to improve the coverage, accuracy and precision of methods for predicting protein function. This could, in theory, be done by considering threedimensional structures, because a protein's function is determined more directly by its structure and dynamics than by its sequence. So why have structures not been used as widely as sequences in genomics? There are at least two reasons. First, three-dimensional structures are available for only a fraction of proteins. But this limitation should be reduced by structural genomics within a few years14. Structural genomics aims to determine the structures of around 10,000 carefully chosen protein domains, such that all other protein sequences can be modelled with useful accuracy15. Second, functional details that can be extracted from structure but not from sequence often depend on the details of that structure in the cellular environment, as well as on its dynamics and energetics, all of which are difficult to obtain by existing experimental and theoretical techniques.

As well as making predictions, bioinformaticians are facing the more practical challenge of making others aware of their predictions. Prediction methods need to be evaluated rigorously and made accessible over the internet16. Moreover, varied experimental data and theoretical predictions must be integrated, because no single experimental or computational approach is likely to result in accurate and complete models of protein assemblies and pathways. The latest computational methods for mapping functional links should make a big contribution to such models.


Andrej Sali is at the Laboratories of Molecular Biophysics, Pels Family Center for Biochemistry and Structural Biology, The Rockefeller University, 1230 York Avenue, New York, New York 10021, USA.
e-mail: sali@rockefeller.edu

------------------

References

  1. Marcotte, E. M., Pellegrini, M., Thompson, M. J., Yeates, T. O. & Eisenberg, D. Nature 402, 83-86 (1999). Links
  2. Enright, A., Iliopoulos, I., Kyrpides, N. C. & Ouzounis, C. A. Nature 402, 86-90 (1999). Links
  3. Mendelsohn, A. R. & Brent, R. Science 284, 1948-1950 (1999). Links
  4. Koonin, E. V., Tatusov, R. L. & Galperin, M. Y. Curr. Opin. Struct. Biol. 3, 355-363 (1998). Links
  5. Bork, P. & Koonin, E. V. Nature Genet. 18, 313-318 (1998). Links
  6. Chervitz, S. A. et al. Nucleic Acids Res. 27, 74-78 (1999). Links
  7. Pellegrini, M., Marcotte, E. M., Thompson, M. J., Eisenberg, D. & Yeates, T. O. Proc. Natl Acad. Sci. USA 96, 4285-4288 (1999). Links
  8. Marcotte, E. M. et al. Science 285, 751-753 (1999). Links
  9. Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. Proc. Natl Acad. Sci. USA 95, 14863-14868 (1998). Links
  10. Gaasterland, T. & Ragan, M. J. Microb. Comp. Genomics 3, 305-312 (1998). Links
  11. Dandekar, T. et al. Trends Biochem. Sci. 23, 324-328 (1998). Links
  12. Overbeek, R., Fonstein, M., D'Souza, M., Pusch, G. D. & Maltsev, N. Proc. Natl Acad. Sci. USA 96, 2896-2901 (1999). Links
  13. Mewes, H. W. Nucleic Acids Res. 27, 44-48 (1999). Links
  14. Burley, S. K. et al. Nature Genet. 23, 151-157 (1999). Links
  15. Sánchez, R. & Sali, A. Proc. Natl Acad. Sci. USA 95, 13597-13602 (1998). Links
  16. Brenner, S. E., Barken, D. & Levitt, M. Nucleic Acids Res. 27, 251-253 (1999). Links



Macmillan MagazinesNature © Macmillan Publishers Ltd 1999 Registered No. 785998 England.