Bioinformatics Final Project

MB&B 452a (Fall 1999)

Instructor: Mark Gerstein

 

 

 

 

 

 

 

Title: Assignment of protein functions from genome sequences

Suk-Jo Kang

 

 

 

 

 

 

 

 

 

 

 

The fully sequenced genomes of over 20 organisms have led to the rapid growth of the sequence or its related databases, which leaves a vast amount of genes unannotated. For genome projects to be successful there should be fast and reliable ways to identify the functions of unknown proteins. Experimentally, yeast two hybrid system, mass spectrometry together with two-dimensional gel electrophoresis, and DNA microarray technology have been used to examine protein-protein interactions. However, it is getting more needed to employ computational methods in order to cope with the rapidly growing data (Casari et al., 1995). Here, I would like to review several novel approaches for the functional assignment of uncharacterized proteins.

Up to now, sequence similarity search has been most popular and widely used for discovering protein functions as well as structures. The function of a query protein can be deduced from comparison of the amino-acid sequence of the query protein with those of homologous proteins of known function. Recently, PSI-BLAST(Position-Specific Iterative BLAST) was developed by incorporating gapped alignment and profile analysis. PSI-BLAST is advantageous in that it is fast and highly sensitive, so that it is possible to search proteins that are not similar in sequence but biological meaningful to the query protein. The issues involved in sequence similarity search were reviewed well in other places (Altschul et al., 1994; Bork and Koonin, 1998). However, it is worth noticing the limitations in predicting function by homology search. Based on the initial assumption, it cannot assign "novel" function(s) to the query protein, or "any" function if you cannot find any sequence homology with known function from the database. In addition, the sequence identity does not always match with the functional resemblance.

New methods that are indirectly related to sequence similarity search have been proposed to infer functions. In those methods, proteins are classified into "functionally linked groups" according to the assumption on which each method is based.

The first way is to group proteins according to genome organization in chromosomes. The idea is based on the observations that some functionally related genes are likely to be closer to each other than do unrelated ones. The high degree of clustering of genes belonging to the same functional class makes the classification using conserved gene clusters biologically meaningful. The genes in one class may be conserved under natural selection (Tamames J. et al., 1997). Furthermore, early studies showed that a conserved gene order could be correlated with physical interactions between the encoded proteins (Dandekar T. et al., 1998). Therefore, conservation of order or clustering of functionally coupled genes could be used as a tool for predicting both protein-protein interaction and protein function. For example, if a certain gene cluster was elucidated in genomes and a new protein sequence in a certain organism was identified and located within a gene cluster similar to the former cluster, the new protein can be considered as functionally coupled to the others in the second cluster with similar function of the genes in similar order or location in the former cluster. Recently, whether this approach can be used to assign functions to the uncharacterized genes was tested (Overbeek et al., 1999).

The second method, called phylogenetic profiling, is to group proteins that participate in a common assembly or metabolic pathway by tracing correlated evolution. This method assumes that proteins within a functionally linked group evolve together and, therefore, it is possible to find homologs in the same subset of organisms. Thus, proteins clustered together according to the similarity of their phylogenetic profiles will be functionally related. Pellegrini et al (1999) tested this hypothesis by grouping the proteins encoded by E.coli genome on the basis of similar keywords in SwissProt and making phylogenetic profiles for them from homology search in other fully sequenced genomes. In this study, they showed that proteins with phylogenetic profiles similar to that of a query protein tend to be functionally linked with the query protein. Then, it was shown that groups of proteins known to be functionally linked often have similar phylogenetic profiles or likely to be neighbors in profile space. Therefore, it suggests that it is possible to link the function of an uncharacterized query protein to that of proteins with identical profiles. However, it is conceivable that the phylogeny profile is not completely free of the effect of lateral gene transfer on the phylogeny of prokaryotes (Doolittle, 1999a, b; see technical comments of Huynen et al, 1999b, Stiller and Hall, 1999b, Gupta and Soltys, 1999b on Doolittle, 1999a).

The third method, called domain-fusion analysis, is supported by the observation that a single protein chain in one organism shows homology with separate interacting proteins in another organism in such a way that the interacting proteins are fused into a single peptide chain, called Rosetta Stone Protein. The analysis searching for such links has the potential to predict protein pairs with related biological functions and physical protein-protein interactions (Enright et al., 1999; Marcotte et al., 1999a; Sali, 1999) However, it should be noted that there are several possible cases of false-positives for predicting physical interactions. One of them is that domains are fused but not interacting, such as regulatory or signaling domains. Moreover, this analysis cannot distinguish the interactions of proteins with homology domains like SH2 domain. Even though the testers of this analysis argued that such "promiscuous" domains contribute only ~5% of all domains, the number should be reconsidered according to the contribution of such domains to the biological processes.

The gene clusters, phylogenetic profiles, and domain fusions are meaningful approaches in that they provide new links between non-homologous proteins that cannot be obtained from traditional sequence matching schemes. While the above methods are dependent upon sequence homology directly or indirectly, two assignment schemes that are independent of sequence matching were presented.

The mRNA expression pattern analysis is to group genes together with similarity in genome-wide expression patterns of DNA microarray hybridization. In two subsequent papers, researchers obtained time course data of transcription representing the various responses of the budding yeast S. cerevisiae such as the mitotic cell division cycle, sporulation, the diauxic shift, and shock response, and the growth response of human fibroblast to fetal bovine serum (Eisen et al., 1998; Iyer et al., 1999). They observed tight clustering of genes with non-homologous sequence but similar function, and coordinated regulation of groups of genes whose products act at different steps in a common process. This analysis suggests a new way to discover unknown genes whose expression was regulated in specific temporal patterns and to assign potentially linked functions to those proteins if the coexpression pattern data obtained under a variety conditions is used for categorizing genes into functional sets.

Predicting protein functions can be performed more precisely by determining three-dimensional structures. This method is advantageous over the above methods to find functionally linked proteins, in that sequence similarity is more closely related to similarity in structure than in function and that functional units of protein are domains or modules which are defined more precisely by structural analysis. Even though there are many examples where proteins with different folds have the same function, it is more practical to classify proteins into families according to similarity in structural units, that is, folding domains. This approach can be automated in conjunction with a fold library (Gerstein and Hegyi, 1998; Hegyi and Gerstein, 1999). The applicability of this method can be enhanced by determining the function of each folding domain in the context of the overall structure, amino-acid residues participating in the activity or the environment where the protein functions. This kind of refinement may be useful to identify and characterize the functions of multi-domain proteins.

It is conceivable that prediction of protein functions will be more precise when the above methods are combined. In the early study, Tamames et al (1996) clustered proteins from various organisms in two ways: functional annotation and sequence motifs. Then they found common sets from both belongings. The recent study demonstrated high confidence functional links of proteins of the yeast through grouping by phylogenetic profiles, mRNA expression patterns, and domain-fusions (Marcotte et al, 1999b, Sali 1999). As shown in those studies, combinatorial approach is powerful for narrowing down the classification and possibly pinpointing protein function. Sooner or later, a new interface or database is expected to combine sequence similarity, functional clustering, and structure analysis altogether. (*Word counts for body text only: 1386)

 

 

 

 

 

References:

Altschul, S. F., Boguski, M. S., Gish, W., and Wootton, J. C. (1994) Issues in searching molecular sequence databases. Nature genetics 6, 119-129.

Bork, P., and Koonin, E.V. (1998) Predicting functions from protein sequences — where are the bottlenecks? Nature genetics 18, 313-318.

Casari, G., Andrade, M. A., Bork, P., Boyle, J., Daruvar, A., Ouzounis, C., Schneider, R., Tamames, J., Valencia., A., and Sander, C. (1995) Challenging times for bioinformatics. Nature 376, 647-648.

Dandekar, T., Snel, B., Huynen, M., and Bork, P. (1998) Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem. Sci. 23, 324-328.

Doolittle, W. F. (1999a) Phylogenetic classification and the universial tree. Science 284, 2124-2128.

Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95, 14863-14868.

Enright, A.J., Iliopoulos, I., Kyrpides, N.C., and Ouzounis, C.A. (1999) Protein interaction maps for complete genomes based on gene fusion events. Nature 402, 86-89.

Gerstein, M., and Hegyi, H. (1998) Comparing genomes in terms of protein structure : surveys of a finite parts list. FEMS Microbiol. Rev. 22, 277-304.

Hegyi, H., and Gerstein, M. (1999) The relationship between protein structure and function : A comprehensive survey with application to the yeast genome. J. Mol. Biol. 288, 147-164.

Huynen, M., Snel, B., and Bork, P. ; Stiller J.W., and Hall, B. D. ; Gupta, R. S., and Soltys, B. J. ; Doolittle W. F. (1999b) Lateral gene transfer, genome surveys, and the phylogeny of prokaryotes. Science 286, 1443a (technical comments).

Iyer, V. R., Eisen, M. B., Ross, D. T., Schuler, G., Moore, T., Lee, J. C. F., Trent, J. M., Staudt, L. M., Hudson, J., Boguski, M. S., Lashkari, D., Shalon, D., Botstein, D., and Brown, P. O. (1999) The transcriptional program in the response of human fibroblast to serum. Science 283, 83-87.

Marcotte, E. M., Pellegrini, M., Ng, H-L., Rice, D.W., Yeates, T. O., and Eisenberg, D. (1999a) Detecting protein function and protein-protein interactions from genome sequences. Science 285, 751-753.

Marcotte, E.M., Pellegrini, M., Thompson, M.J., Yeates, T. O., and Eisenberg, D. (1999b) A combined algorithm for genome-wide prediction of protein function. Nature 402, 83-86.

Overbeek, R., Fonstein, M., D’Souza, M., Pusch, G. D., and Maltsev, N. (1999) The use of gene clusters to infer functional coupling. Proc. Natl. Acad. Sci. USA 96, 2896-2901.

Pellegrini, M., Marcotte, E. M., Thompson, M. J., Eisenberg, D., and Yeates, T.O. (1999) Assigning protein functions by comparative genome analysis : protein phylogenetic profiles. Proc. Natl. Acad. Sci. USA 96, 4285-4288.

Sali, A. (1999) Functional links between proteins. Nature 402, 23-26.

Tamames, J., Ouzounis, C., Sander, C., and Valencia, A. (1996) Genomes with distinct function composition. FEBS Letters 389, 96-101.

Tamames, J., Casari, G., Ouzounis, C., and Valencia, A. (1997) Conserved clusters of functionally related genes in two bacterial genomes. J. Mol. Evol. 44, 66-73.