GeneCensus Help and Information Page

Linkage Conventions

The linking conventions for Gene Census are summarized in

http://bioinfo.mbb.yale.edu/genome/linkhelp.txt

How Sequence Masking Works

This figure illustrates how "sequence masking" works. Various regions of the genomes are annotated with different structural features, such as transmembrane helices or homology to known structure. Sometimes these features overlap, as is often the case for TM-helices and low-complexity regions. After "masking" the first four structural features (PDB matches, low-complexity regions, TM-helices, and linkers), one is left with uncharacterized regions, which can be characterized by a limited amount of structure prediction.

Here is a description of the specific files used in the masking.

Abbrev.	Num	Full Mask Name	Sequence file IN	Sequence file OUT (MBY = masked by)
pdb	1	minscop soluble matches	seq	seq MBY pdb
pdn	1	minscop soluble matches no overlap (best choice for PDB matches)	seq	seq MBY pdb
pdo	1	minscop soluble matches overlap	seq	seq MBY pdo
lcl	2	low complexity long	seq	seq MBY lcl
lcl	2	low complexity long	seq MBY pdb	seq MBY pdb MBY lcl
tms	3	tm segs	seq	seq MBY tms
tms	3	tm segs (best choice for TM segments)	seq MBY pdb MBY lcl	seq MBY pdb MBY lcl MBY tms
tmf	3	tm segs filtered	seq MBY pdb MBY lcl	seq MBY pdb MBY lcl MBY tmf
lnk	4	linkers	seq MBY pdb MBY lcl MBY tms	seq MBY pdb MBY lcl MBY tms MBY lnk
lnk	4	linkers	seq	seq MBY lnk
ucd	5	unchar domains	seq	seq MBY ucd
cdo	6	characterized domains	seq	seq MBY cdo
alp	7	alla segs	seq MBY pdb MBY lcl MBY tms MBY lnk	seq MBY pdb MBY lcl MBY tms MBY lnk MBY alp
bet	8	allb segs	seq MBY pdb MBY lcl MBY tms MBY lnk MBY alp	seq MBY pdb MBY lcl MBY tms MBY lnk MBY alp MBY bet
nul	9	full len segs	seq	seq MBY nul

General Methods Used

Translated genome sequences were taken from the web sites (listed in the table). The genome data is constantly changing and is contingent on the current "state of the art" in gene finding. The data used in this paper reflects a particular snapshot of this ongoing process. For instance, the E. coli data file used was version M52, containing 4290 ORFs. This is more a recent version and contains a different number of ORFs than one referred to in the official publication (M49, containing 4288 ORFs, Blattner et al., 1997). Structures were taken from the PDB via the PDB browser (Abola et al. 1997; Stampf et al., 1995). Domain fold and class definitions were taken from scop (version 1.35, May 1996) (Brenner et al., 1996; Murzin et al., 1995; Hubbard et al., 1997). Core structures for each domain were based on refinement of structural alignments (Altman & Gerstein, 1994; Gerstein & Altman, 1995; Gerstein & Levitt, 1996, 1997). The biophysical protein list was constructed in a subjective fashion, based on conversations with colleagues and reading the literature.
The browser attempts to provide a simple view onto a simple relational database, with tables cross-referencing sequence identifiers, structure matches, TM-helix positions, and so forth and cross-tabulation reports giving the occurrence of various patterns. The database is implemented in DBM, Perl5 (Wall et al., 1996) and mini-SQL (http://Hughes.com.au). The tables are structured in such a way that all the genome features (e.g. location of a TM-helix or PDB match) are annotated in a consistent fashion, with thresholds and scoring schemes applied consistently over multiple tools. This attempt at consistency is similar to what has been achieved in other genome annotation systems that aim to integrate multiple tools (Gaasterland & Sensen, 1996; Medigue et al., 1995).
All sequence comparison was done with the FASTA program (version 2.0) (Lipman & Pearson, 1985; Pearson & Lipman, 1988) with k-tup 1 and an "e-value" threshold of 0.01.
The structures in the PDB were clustered into 1135 representative domains. The clustering was similar in spirit to the many previous divisions of the PDB into representative chains (e.g. Hobohm & Sander, 1992, 1994; Brenner, 1997, 1998; Boberg et al., 1992). However, a slightly different multiple-linkage algorithm was used (Kaufman & Rousseeuw, 1990). It was designed to be internally consistent with the search method used to identify homologues in the genomes, using the same similarity criteria (a FASTA e-value threshold). The clustering algorithm takes the results of an all-vs-all comparison of the PDB and creates a graph that has one vertex for each sequence and one edge for each similarity score. Each vertex starts out as a cluster of size one. Since sequence similarity scores (i.e. e-values) are not commutative, this directed graph is converted to an undirected graph by removing the better scoring edges between pairs. Then, each edge is considered in turn, and the two clusters associated by this edge are merged into a single cluster if every member of the first cluster has a good scoring edge between it and every member of the second cluster, and vice versa. The edges are considered in order of decreasing similarity. This has the advantage that close relationships are considered before more distant ones, ensuring that distant relationships are not erroneously used to add a member to a cluster when there exists (for that member) a much closer relationship that would lead to an alternate clustering. Furthermore, this algorithm will produce the same result on the same data set every time; i.e. it is not affected by the order in which the data is traversed.
Cluster trees based on distance matrices were built with the Kitsch program, which is part of the Phylip package (Felsenstein, 1989, 1993).
Transmembrane segments were identified using the GES hydrophobicity scale (Engelman et al., 1986). The values from the scale for amino acids in a window of size 20 (the typical size of a transmembrane helix) were averaged and then compared against a cutoff of -1 kcal/mole. A value under this cutoff was taken to indicate the existence of a transmembrane helix. Initial hydrophobic stretches corresponding to signal sequences for membrane insertion were excluded. (These have the pattern of a charged residue within the first seven, followed by a stretch of 14 with an average hydrophobicity under the cutoff.)
Low-complexity sequences were identified with the SEG program (Wootton & Federhen, 1993, 1996; Wootton, 1994) using the standard parameters K(1)= 3.4 and K(2)=3.75, and a window of length 45. These parameters are the ones used to find "long" domain-size low-complexity regions. Characterized regions are considered to be PDB matches, TM-helices, or low complexity regions. Linker regions were considered to be stretches of uncharacterized sequence that connected two characterized regions and were less than 50 residues in length. Linkers also included short sequences at the N or C terminus. Initial Met residues were excluded from the statistics on linker regions.
Secondary structure prediction was done using the GOR program (Garnier et al., 1996; Garnier et al., 1978; Gibrat et al., 1987). This is a well-established and commonly used method.

Two specific limitations to the fold-usage analysis presented here are discussed below.

Limitations of the Approach: The Small, Incomplete Number of Known Folds

First, only a relatively small number of folds can be surveyed, involving no more than a fifth of the ORFs in a genome. (This number would be even smaller if one were to restrict attention to just the ORFs in a genome that have been studied directly by crystallography or NMR. For example, it is currently 52 out of 6218 for yeast, as reported by Sacc3D (Cherry et al., 1998).)

The situation is expected to improve in the future as new structures are determined, but it will be a while before all the folds in a genome are known -- especially considering that the increase in new folds is much slower than the increase in new structures (Brenner et al., 1997). An important corollary of this is that the absolute counts found in a given genome survey are (usually) an under-representation of the true numbers. Furthermore, they are contingent on the evolving contents of the databank. Thus, over time as more structures are added to the databank, one should expect such statistics as the most common folds and number of shared folds to change somewhat.

Comprehensive application of ab initio structure prediction and advanced sequence-comparison and fold recognition methods to complete genomes can overcome somewhat the limitations of only knowing a small number of folds (Fischer & Eisenberg, 1997; Gerstein, 1997), allowing one to survey the complete inventory of proteins in an organism. However, in its present form, structure prediction is not a substitute for structure determination, especially in situations where the fold is completely new. Moreover (as discussed below) using state-of-the-art sequence comparison methods introduces a measure of variability and uncertainty into the results, as different methods will give different results at the margin.

In summary, different comparison programs and fold databases will give different numbers.

Limitations of the Approach: Biases in PDB and in the Genomes

In addition to rendering the results here, in a sense, incomplete, the small number of known folds also means that the results may be influenced to some degree by the biases in the PDB. These biases are manifest in a number of ways.

First and most simply, there is a considerable disparity between how often a fold occurs in the genomes (i.e. how many total matches it has) and how often it occurs in the PDB (i.e. how many known structures have this fold). This is indicated in Figure 1@@@ (and in the web presentation). One can immediately see how different the common folds are in the PDB versus in the genomes. This illustrates in a direct sense the biases in the PDB -- although these sort of biases are not expected to affect the results (which are principally concerned with "membership" rather than absolute counts).

Second and more subtly, the composition of the PDB is biased towards folds that occur in more heavily studied organisms such as EC and SC. These biases are probably reflected in some of the results -- specifically, in the finding that there are many more known folds and unique folds in the bacterium HI than in the archeon MJ, even though both of these organisms have genomes of approximately the same size.

Another subtle bias in the results here is in the selection of genomes. The eight organisms picked were the first with complete genomes to be sequenced, as has by necessity been done in all the other multi-genome comparisons to date (e.g. Tatusov et al., 1997). A more balanced comparison would perhaps have a more comparable amount of eukaryotes and archaea to bacteria.

References

[census home]