Information Page Specific to Worm-Yeast
Fold Comparison in Chervitz et al.
The general spirit of the analysis followed closely what was
done for in recent comparison of microbial genomes (Gerstein & Levitt,
1997; Gerstein, 1997; Gerstein, 1998a; Gerstein, 1998b).
- Genome Data
- Translated genome sequences were taken the carefully designed
web-site of Chervitz et al. (1998). The genome data is constantly
changing and is contingent on the current "state of the
art" in gene finding. The data used here reflects a particular
snapshot of this ongoing process. In particular, the worm ORF
file used by Chervitz et al. is considerably larger than the
current worm-pep file (worm-pep 16). This was used for the reported
analysis, but the all calculations were also repeated just using
the proteins in wormpep 16. (These are the CE-wp16 results) The
3557 "worm-only" ORFs were taken from a list prepared
by Chervitz et al.
- Structure Data
- Structures were taken from the PDB via the PDB browser (Abola
et al. 1997; Stampf et al., 1995). Domain fold and class definitions
were taken from scop. (Version 1.35, May 1996, was used.) (Brenner
et al., 1996; Murzin et al., 1995; Hubbard et al., 1997). Core
structures for each domain were based on refinement of structural
alignments (Altman & Gerstein, 1994; Gerstein & Altman,
1995; Gerstein & Levitt, 1996, 1997).
- The structures in the PDB were clustered into 1135 representative
domains. The clustering was similar in spirit to the many previous
divisions of the PDB into representative chains (e.g. Hobohm
& Sander, 1992, 1994; Brenner, 1997, 1998; Boberg et al.,
1992). However, a slightly different multiple-linkage algorithm
was used (Kaufman & Rousseeuw, 1990). It was designed to
be internally consistent with the search method used to identify
homologues in the genomes, using the same similarity criteria
(a FASTA e-value threshold).
- All information was put into a database analysis system,
with tables cross-referencing sequence identifiers, structure
matches, and so forth and cross-tabulation reports giving the
occurrence of various patterns. The database has been described
previously (Gerstein & Levitt, 1997; Gerstein, 1997; Gerstein,
1998a; Gerstein, 1998b). It is implemented using DBM, Perl5 (Wall
et al., 1996) and the Informix database system. Most of the tables
on the data base are accessible from this Website in a simple
tab to limited form. The tables are structured in such a way
that all the genome features (e.g. location of a TM-helix or
PDB match) are annotated in a consistent fashion.
- Sequence Comparison
- All sequence comparison was done with the FASTA program (version
2.0) (Lipman & Pearson, 1985; Pearson & Lipman, 1988)
with k-tup 1 and an "e-value" threshold of 0.01. This
provides a similar level of sensitivity to the WU-Blast program
used by Chervitz et al. (1998).
Explanation of Results
- Table Related to Fold Usage in Different Genomes
- The table shows the usage of each of the known folds in eight different genomes. The entire table is available over the web. Column "class" is the structural class that the fold belongs to, as determined by scop (Murzin et al., 1995). Column "obj_id" is the fold number in scop 1.35. Columns with two letter species names (e.g. "EC", "SC")
give the total number of matches in one genome for a particular fold. CE stands for C. elegans and CO for the worm only part of CE. The rest of the genome names are defined on the GeneCensus browser page. Column "N_minsp" gives the number of sequence families with a particular fold in the PDB. This column is used to determine whether or not a fold is a superfold (top-25 in terms of the number of sequence families), and the superfolds are highlighted by inverted boxes. Column "N_scop" gives the number of times a particular fold occurs in the PDB, i.e. how many structures have been solved with this fold. Column "bestrep" gives a representative structure with this fold, including residue selection. (In the table "dom" is used as an abbreviation for domain, "Nt-dom," for N-terminal domain, and "Ct-dom," for C-terminal domain.)
- Venn Diagram
- A Venn diagram showing the number of folds in each genome and how many of these folds are shared between different genomes.
Abola EE, S. J., Prilusky J, Manning NO (1997). Protein Data Bank archives of three-dimensional macromolecular structures. Meth. Enz. 277,
Altman, R. & Gerstein, M. (1994). Finding an Average Core Structure: Application to the Globins. In Proceedings of the Second International Conferene on Intelligent Systems in Molecular Biology. (ed. pp. 19-27, AAAI Press, Menlo Park, CA).
Boberg, J., Salakoski, T. & Vihinen, M. (1992). Selection of a representative set of structures from Brookhaven Protein Data Bank. Proteins 14, 265-76.
Brenner, S., Chothia, C. & Hubbard, T. (1998). Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl. Acad. Sci. USA 95, 6073-6078.
Brenner, S., Chothia, C., Hubbard, T. J. P. & Murzin, A. G. (1996). Understanding Protein Structure: Using Scop for Fold Interpretation. Meth. Enz. 266, 635-642.
Chervitz, S. A., Aravind, L., Sherlock, G., Ball, C. A., Koonin, E. V., Dwight, S. S., Harris, M. A., Dolinski, K., Mohr, S., Smith, T., Weng, S., Cherry, J. M. & Botstein, D. (1998). Comparison of the complete protein sets of worm and yeast: orthology and divergence. Science 282, 2022-8.
Consortium (1998). Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282, 2012-8.
Gerstein, M. (1997). A Structural Census of Genomes: Comparing Eukaryotic, Bacterial and Archaeal Genomes in terms of Protein Structure. J. Mol. Biol. 274, 562-576.
Gerstein, M. (1998a). How Representative are the Known Structures of the Proteins in a Complete Genome? A Comprehensive Structural Census. Folding & Design 3, 497-512.
Gerstein, M. (1998b). Patterns of Protein-Fold Usage in Eight Microbial Genomes: A Comprehensive Structural Census. Proteins 33, 518-534.
Gerstein, M. & Altman, R. (1995). Average core structures and variability measures for protein families: Application to the immunoglobulins. J. Mol. Biol. 251, 161-175.
Gerstein, M. & Levitt, M. (1996). Using Iterative Dynamic Programming to Obtain Accurate Pairwise and Multiple Alignments of Protein Structures. In Proc. Fourth Int. Conf. on Intell. Sys. Mol. Biol. (ed. pp. 59-67, AAAI Press, Menlo Park, CA).
Gerstein, M. & Levitt, M. (1997). A Structural Census of the Current Population of Protein Sequences. Proc. Natl. Acad. Sci. USA 94, 11911-11916.
Gerstein, M. & Levitt, M. (1998). Comprehensive Assessment of Automatic Structural Alignment against a Manual Standard, the Scop Classification of Proteins. Protein Science 7, 445-456.
Hobohm, W., Scharf, M., Schneider, R. & Sander, C. (1992). Selection of representative protein data sets. Prot. Sci. 1, 409-417.
Hubbard, T. J. P., Murzin, A. G., Brenner, S. E. & Chothia, C. (1997). SCOP: a structural classification of proteins database. Nucleic Acids Res 25, 236-9.
Kaufman, L. & Rousseeuw, P. J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, New York.
Lipman, D. J. & Pearson, W. R. (1985). Rapid and sensitive protein similarity searches. Science 227, 1435-1441.
Murzin, A., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). SCOP: A Structural Classification of Proteins for the Investigation of Sequences and Structures. J. Mol. Biol. 247, 536-540.
Pearson, W. R. & Lipman, D. J. (1988). Improved Tools for Biological Sequence Analysis. Proc. Natl. Acad. Sci. USA 85, 2444-2448.
Stampf, D. R., Felder, C. E. & Sussman, J. L. (1995). PDBbrowse--a graphics interface to the Brookhaven Protein Data Bank. Nature 374, 572-4.
Wall, L., Christiansen, D. & Schwartz, R. (1996). Programming Perl. O'Reilly and Associates, Sebastapol, CA.