Genecensus Comparison on the Fold Usage in Worm vs. Yeast

Information Page Specific to Worm-Yeast Fold Comparison in Chervitz et al.

The general spirit of the analysis followed closely what was done for in recent comparison of microbial genomes (Gerstein & Levitt, 1997; Gerstein, 1997; Gerstein, 1998a; Gerstein, 1998b).

Methods

Genome Data: Translated genome sequences were taken the carefully designed web-site of Chervitz et al. (1998). The genome data is constantly changing and is contingent on the current "state of the art" in gene finding. The data used here reflects a particular snapshot of this ongoing process. In particular, the worm ORF file used by Chervitz et al. is considerably larger than the current worm-pep file (worm-pep 16). This was used for the reported analysis, but the all calculations were also repeated just using the proteins in wormpep 16. (These are the CE-wp16 results) The 3557 "worm-only" ORFs were taken from a list prepared by Chervitz et al.
Structure Data: Structures were taken from the PDB via the PDB browser (Abola et al. 1997; Stampf et al., 1995). Domain fold and class definitions were taken from scop. (Version 1.35, May 1996, was used.) (Brenner et al., 1996; Murzin et al., 1995; Hubbard et al., 1997). Core structures for each domain were based on refinement of structural alignments (Altman & Gerstein, 1994; Gerstein & Altman, 1995; Gerstein & Levitt, 1996, 1997).; The structures in the PDB were clustered into 1135 representative domains. The clustering was similar in spirit to the many previous divisions of the PDB into representative chains (e.g. Hobohm & Sander, 1992, 1994; Brenner, 1997, 1998; Boberg et al., 1992). However, a slightly different multiple-linkage algorithm was used (Kaufman & Rousseeuw, 1990). It was designed to be internally consistent with the search method used to identify homologues in the genomes, using the same similarity criteria (a FASTA e-value threshold).
Database: All information was put into a database analysis system, with tables cross-referencing sequence identifiers, structure matches, and so forth and cross-tabulation reports giving the occurrence of various patterns. The database has been described previously (Gerstein & Levitt, 1997; Gerstein, 1997; Gerstein, 1998a; Gerstein, 1998b). It is implemented using DBM, Perl5 (Wall et al., 1996) and the Informix database system. Most of the tables on the data base are accessible from this Website in a simple tab to limited form. The tables are structured in such a way that all the genome features (e.g. location of a TM-helix or PDB match) are annotated in a consistent fashion.
Sequence Comparison: All sequence comparison was done with the FASTA program (version 2.0) (Lipman & Pearson, 1985; Pearson & Lipman, 1988) with k-tup 1 and an "e-value" threshold of 0.01. This provides a similar level of sensitivity to the WU-Blast program used by Chervitz et al. (1998).

Explanation of Results

Table Related to Fold Usage in Different Genomes: The table shows the usage of each of the known folds in eight different genomes. The entire table is available over the web. Column "class" is the structural class that the fold belongs to, as determined by scop (Murzin et al., 1995). Column "obj_id" is the fold number in scop 1.35. Columns with two letter species names (e.g. "EC", "SC") give the total number of matches in one genome for a particular fold. CE stands for C. elegans and CO for the worm only part of CE. The rest of the genome names are defined on the GeneCensus browser page. Column "N_minsp" gives the number of sequence families with a particular fold in the PDB. This column is used to determine whether or not a fold is a superfold (top-25 in terms of the number of sequence families), and the superfolds are highlighted by inverted boxes. Column "N_scop" gives the number of times a particular fold occurs in the PDB, i.e. how many structures have been solved with this fold. Column "bestrep" gives a representative structure with this fold, including residue selection. (In the table "dom" is used as an abbreviation for domain, "Nt-dom," for N-terminal domain, and "Ct-dom," for C-terminal domain.)
Venn Diagram: A Venn diagram showing the number of folds in each genome and how many of these folds are shared between different genomes.

References

Abola EE, S. J., Prilusky J, Manning NO (1997). Protein Data Bank archives of three-dimensional macromolecular structures. Meth. Enz. 277,
556-571.

Altman, R. & Gerstein, M. (1994). Finding an Average Core Structure: Application to the Globins. In Proceedings of the Second International Conferene on Intelligent Systems in Molecular Biology. (ed. pp. 19-27, AAAI Press, Menlo Park, CA).

Boberg, J., Salakoski, T. & Vihinen, M. (1992). Selection of a representative set of structures from Brookhaven Protein Data Bank. Proteins 14, 265-76.

Brenner, S., Chothia, C. & Hubbard, T. (1998). Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl. Acad. Sci. USA 95, 6073-6078.

Brenner, S., Chothia, C., Hubbard, T. J. P. & Murzin, A. G. (1996). Understanding Protein Structure: Using Scop for Fold Interpretation. Meth. Enz. 266, 635-642.

Chervitz, S. A., Aravind, L., Sherlock, G., Ball, C. A., Koonin, E. V., Dwight, S. S., Harris, M. A., Dolinski, K., Mohr, S., Smith, T., Weng, S., Cherry, J. M. & Botstein, D. (1998). Comparison of the complete protein sets of worm and yeast: orthology and divergence. Science 282, 2022-8.

Consortium (1998). Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282, 2012-8.

Gerstein, M. (1997). A Structural Census of Genomes: Comparing Eukaryotic, Bacterial and Archaeal Genomes in terms of Protein Structure. J. Mol. Biol. 274, 562-576.

Gerstein, M. (1998a). How Representative are the Known Structures of the Proteins in a Complete Genome? A Comprehensive Structural Census. Folding & Design 3, 497-512.

Gerstein, M. (1998b). Patterns of Protein-Fold Usage in Eight Microbial Genomes: A Comprehensive Structural Census. Proteins 33, 518-534.

Gerstein, M. & Altman, R. (1995). Average core structures and variability measures for protein families: Application to the immunoglobulins. J. Mol. Biol. 251, 161-175.

Gerstein, M. & Levitt, M. (1996). Using Iterative Dynamic Programming to Obtain Accurate Pairwise and Multiple Alignments of Protein Structures. In Proc. Fourth Int. Conf. on Intell. Sys. Mol. Biol. (ed. pp. 59-67, AAAI Press, Menlo Park, CA).

Gerstein, M. & Levitt, M. (1997). A Structural Census of the Current Population of Protein Sequences. Proc. Natl. Acad. Sci. USA 94, 11911-11916.

Gerstein, M. & Levitt, M. (1998). Comprehensive Assessment of Automatic Structural Alignment against a Manual Standard, the Scop Classification of Proteins. Protein Science 7, 445-456.

Hobohm, W., Scharf, M., Schneider, R. & Sander, C. (1992). Selection of representative protein data sets. Prot. Sci. 1, 409-417.

Hubbard, T. J. P., Murzin, A. G., Brenner, S. E. & Chothia, C. (1997). SCOP: a structural classification of proteins database. Nucleic Acids Res 25, 236-9.

Kaufman, L. & Rousseeuw, P. J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, New York.

Lipman, D. J. & Pearson, W. R. (1985). Rapid and sensitive protein similarity searches. Science 227, 1435-1441.

Murzin, A., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). SCOP: A Structural Classification of Proteins for the Investigation of Sequences and Structures. J. Mol. Biol. 247, 536-540.

Pearson, W. R. & Lipman, D. J. (1988). Improved Tools for Biological Sequence Analysis. Proc. Natl. Acad. Sci. USA 85, 2444-2448.

Stampf, D. R., Felder, C. E. & Sussman, J. L. (1995). PDBbrowse--a graphics interface to the Brookhaven Protein Data Bank. Nature 374, 572-4.

Wall, L., Christiansen, D. & Schwartz, R. (1996). Programming Perl. O'Reilly and Associates, Sebastapol, CA.

[Genome Home]