Jesse Rinehart

Bioinformatics Final Paper

December 10, 1999

 

COGs and Comparative Genomics

 

Complete genome sequences have revolutionized our understanding of complex biological phenomena. In the last 3.5 years more than 20 complete genomes have been published and many more are on the horizon. These genome sequences are changing the landscape of modern research into areas of molecular biology, genetics, and biochemistry. The vast arrays of sequence data, both nucleotide and peptide, in conjunction with the Internet, has provided most researchers with the tools and information to conduct computational genomic analysis. These days, almost any functional study on a particular protein or cellular system can, and perhaps should, involve a comparative-genomics aspect. However, the knowledge of all the sequences of all the genes in multiple genomes brings up issues of gene classifications. There needs to be a natural framework that will allow for the classification of genes in groups that will lead to functional and evolutionary analysis and ultimately automatic functional annotation of newly sequenced genes. This framework itself will grow and evolve with the addition of each newly sequenced genome. One solution has been suggested, and put into practice, that is founded on relationships between homologous gene families from different genomes (1). Tatusov et al. have assembled clusters of orthologous groups (COGs) from three or more phylogenetic lineages to serve as a framework for identifying gene function.

Any two genes from different organisms that belong to the same COG are orthologs (1). Orthologs are derived from a common ancestor and normally have retained the same function over the course of evolution. Tatusov et al. used pairwise gapped BLAST (2) sequence alignments of 17,976 proteins from seven complete genomes (E. coli,H. influenza, M. genitalium, P. pneumoniae, Synechocystis sp., M. jannaschii, and S. cerevisiae) and, for each protein, determined the best hit (BeT). The BeTs were assembled into patterns, the simplest of which is a triangle of three orthologs (Figure 1A) (1). Complex COG formation consists of finding all triangles formed by BeTs and merging the ones with common sides (Figure 1B). Ultimately the groups are assemblies of all the different proteins from different lineages that are likely to be orthologs, and thus define the COG classification. The asymmetry in the patterns is the result of paralogs from the same lineage (1). The COG analysis is far from complete due to biases in the published genomes and lack of coverage, but some interesting trends are evident and additional genomes will enrich the existing COG data.

 

 Go to http://pantheon.yale.edu/~jjr25/binf.html for figures

Figure 1: A) Congruent BeTs form a triangle, the minimal COG. Proteins are KatG E. coli, YKR066c S. cerevisiae, and sII1978 Synechocystis sp. (Adapted from Tatusov, R. S. et al. 1997).

 

Several features were evident from the COG patterns that appeared from the analysis of the seven genomes. Patterns can be identified by noting COGs which contain at least three (the minimal COG) of the species analyzed up to all seven. 114 ubiquitous C OGs (representing all of the species), of which most were translation and transcription components, possibly represent the core requirements of life. What is of greater interest is that of unusual COGs with patterns that occur once. This implies unique functions that scatter over the tree of life (1). The issue of coverage must be noted here. As more genomes are sequenced and the "COG space" evolves, there are likely to be more Archaeal-, Bacterial-, and Eukaryotic-only COGs, and the percentage of COGs that represent all forms of life will drop. Currently, comparative-genomics by COG alignment (which can be accessed at http://www.ncbi.nlm.nih.gov/COG) has been useful in describing complex evolutionary trends in gene clusters. These COG patterns, which occur once in the analysis by Tatusov et al., can be explained by further sequence analysis and can possibly be explained by mechanisms of gene loss and horizontal transfer (3).

Additional genome sequences and the application of the more sophisticated PSI-BLAST (2) have recently described a specific example of horizontal gene transfer. These findings have supported and enriched COG analysis. Figure 3 summarizes results from a recent study that shows a greater prevalence of Aquifex aeolicus genes in archaeal COGs (3). PSI-BLAST, which greatly increases sensitivity to weak, but biologically relevant sequence relationships (2), was used to show that, in fact, the Aquifex genome contains several clusters of archaeal genes (3). Thus, COG analysis first identified a relationship and further analysis then strongly suggested a horizontal transfer of genetic information. This gene transfer is relevant because Aquifex is a hyperthermophilic bacterium that occupies a biological niche dominated by the archaea.

The COGs possibly represent a melding of comparative genomics and protein classification. Currently there are numerous ways to annotate sequence data, but the COGs seem to be a unique example of a natural system of classification that uses relatives to an ancestral gene as a basic unit. These units are then associated with a conserved function and the inclusion of a protein in a COG annotates its function (1). The results of COG analysis have already extended into more specific studies. The ability of the COGs to change with additional genomic information suggests that the evolution of COGs will only improve protein classifications in newly sequenced organisms.

 

Figure 2: Adapted from Aravind, L. et al. 1998)

 

  

 

1) Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein families. Science 1997 Oct 24;278(5338):631-7. [NCBI]

2) Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997 Sep 1;25(17):3389-402. [NCBI]

3) Aravind L, Tatusov RL, Wolf YI, Walker DR, Koonin EV Evidence for massive gene exchange between archaeal and bacterial hyperthermophiles. Trends Genet 1998 Nov;14(11):442-4. [NCBI]