GeneCensus Help and Information Page

Linkage Conventions

The linking conventions for Gene Census are summarized in

http://bioinfo.mbb.yale.edu/genome/linkhelp.txt

How Sequence Masking Works

This figure illustrates how "sequence masking" works. Various regions of the genomes are annotated with different structural features, such as transmembrane helices or homology to known structure. Sometimes these features overlap, as is often the case for TM-helices and low-complexity regions. After "masking" the first four structural features (PDB matches, low-complexity regions, TM-helices, and linkers), one is left with uncharacterized regions, which can be characterized by a limited amount of structure prediction.

Here is a description of the specific files used in the masking.

Abbrev.

Num

Full Mask Name

Sequence file IN

Sequence file OUT

(MBY = masked by)

pdb

1

minscop soluble matches

seq

seq MBY pdb

pdn

1

minscop soluble matches no overlap (best choice for PDB matches)

seq

seq MBY pdb

pdo

1

minscop soluble matches overlap

seq

seq MBY pdo

lcl

2

low complexity long

seq

seq MBY lcl

lcl

2

low complexity long

seq MBY pdb

seq MBY pdb MBY lcl

tms

3

tm segs

seq

seq MBY tms

tms

3

tm segs (best choice for TM segments)

seq MBY pdb MBY lcl

seq MBY pdb MBY lcl MBY tms

tmf

3

tm segs filtered

seq MBY pdb MBY lcl

seq MBY pdb MBY lcl MBY tmf

lnk

4

linkers

seq MBY pdb MBY lcl MBY tms

seq MBY pdb MBY lcl MBY tms MBY lnk

lnk

4

linkers

seq

seq MBY lnk

ucd

5

unchar domains

seq

seq MBY ucd

cdo

6

characterized domains

seq

seq MBY cdo

alp

7

alla segs

seq MBY pdb MBY lcl MBY tms MBY lnk

seq MBY pdb MBY lcl MBY tms MBY lnk MBY alp

bet

8

allb segs

seq MBY pdb MBY lcl MBY tms MBY lnk MBY alp

seq MBY pdb MBY lcl MBY tms MBY lnk MBY alp MBY bet

nul

9

full len segs

seq

seq MBY nul

General Methods Used

Top-10 Folds

The figure shows pictures of the ten most common folds that are shared amongst all eight genomes. The figure is arranged from TOP-ROW to BOTTOM-ROW: 3 barrel folds, 3 classic alpha/beta folds with helices packed on either side of a central sheet, 3 folds with helices packed onto a single face of a sheet, and 1 fold with a more complex structure (class II synthetase). All folds are drawn with molscript (Kraulis, 1991). They are somewhat simplified so that coil geometry is smoothed out and insertions not packing against the central sheet are de-emphasized. Folds that are superfolds are indicated by a black circle ("?") in the lower right hand corner.

Explanation of Tables Related to Fold Usage in Different Genomes

The table shows the usage of each of the known folds in eight different genomes. The entire table is available over the web at http://bioinfo.mbb.yale.edu/census/browser/fold-report. Column 1 ("class") is the structural class that the fold belongs to, as determined by scop (Murzin et al., 1995). Column 2 ("Fold#") is the fold number in scop 1.35. Columns 3 to 10 ("EC" to "MG") give the total number of matches in one genome for a particular fold. For instance, the first row shows that there are 19 Rossmann fold domains in the HP genome. The columns are sorted in terms of the total number of matches in the genome with EC having the most and MG, the least. Column 11 ("Total") is the row total of columns 3 to 10, the total number of times the fold occurs in all eight genomes. Column 12 ("Fam.") gives the number of sequence families with a particular fold in the PDB. This column is used to determine whether or not a fold is a superfold (top-25 in terms of the number of sequence families), and the superfolds are highlighted by inverted boxes. Column 13 ("PDB") gives the number of times a particular fold occurs in the PDB, i.e. how many structures have been solved with this fold. This column should be compared with column 11 ("Total") to highlight the biases in the PDB. Column 14 ("Rep. Struc.") gives a representative structure with this fold, including residue selection. (The residue selection for GroEL is A:2-136, A:410-525.) (In the table "dom" is used as an abbreviation for domain, "Nt-dom," for N-terminal domain, and "Ct-dom," for C-terminal domain.)

Disclaimer

All the matches reported by GeneCensus, particularly via the "find folds in genomes" query box are based on simple automatic application of sequence comparison program to the genomes -- in particular FASTA with an e-value threshold of .01. As such, they are expected to under-report the number of true-positive matches compared to more elaborate approaches (though perhaps also have few false positives!).

Two specific limitations to the fold-usage analysis presented here are discussed below.

Limitations of the Approach: The Small, Incomplete Number of Known Folds

First, only a relatively small number of folds can be surveyed, involving no more than a fifth of the ORFs in a genome. (This number would be even smaller if one were to restrict attention to just the ORFs in a genome that have been studied directly by crystallography or NMR. For example, it is currently 52 out of 6218 for yeast, as reported by Sacc3D (Cherry et al., 1998).)

The situation is expected to improve in the future as new structures are determined, but it will be a while before all the folds in a genome are known -- especially considering that the increase in new folds is much slower than the increase in new structures (Brenner et al., 1997). An important corollary of this is that the absolute counts found in a given genome survey are (usually) an under-representation of the true numbers. Furthermore, they are contingent on the evolving contents of the databank. Thus, over time as more structures are added to the databank, one should expect such statistics as the most common folds and number of shared folds to change somewhat.

Comprehensive application of ab initio structure prediction and advanced sequence-comparison and fold recognition methods to complete genomes can overcome somewhat the limitations of only knowing a small number of folds (Fischer & Eisenberg, 1997; Gerstein, 1997), allowing one to survey the complete inventory of proteins in an organism. However, in its present form, structure prediction is not a substitute for structure determination, especially in situations where the fold is completely new. Moreover (as discussed below) using state-of-the-art sequence comparison methods introduces a measure of variability and uncertainty into the results, as different methods will give different results at the margin.

In summary, different comparison programs and fold databases will give different numbers.

Limitations of the Approach: Biases in PDB and in the Genomes

In addition to rendering the results here, in a sense, incomplete, the small number of known folds also means that the results may be influenced to some degree by the biases in the PDB. These biases are manifest in a number of ways.

First and most simply, there is a considerable disparity between how often a fold occurs in the genomes (i.e. how many total matches it has) and how often it occurs in the PDB (i.e. how many known structures have this fold). This is indicated in Figure 1@@@ (and in the web presentation). One can immediately see how different the common folds are in the PDB versus in the genomes. This illustrates in a direct sense the biases in the PDB -- although these sort of biases are not expected to affect the results (which are principally concerned with "membership" rather than absolute counts).

Second and more subtly, the composition of the PDB is biased towards folds that occur in more heavily studied organisms such as EC and SC. These biases are probably reflected in some of the results -- specifically, in the finding that there are many more known folds and unique folds in the bacterium HI than in the archeon MJ, even though both of these organisms have genomes of approximately the same size.

Another subtle bias in the results here is in the selection of genomes. The eight organisms picked were the first with complete genomes to be sequenced, as has by necessity been done in all the other multi-genome comparisons to date (e.g. Tatusov et al., 1997). A more balanced comparison would perhaps have a more comparable amount of eukaryotes and archaea to bacteria.

References

[census home]