http://bioinfo.mbb.yale.edu/genome/linkhelp.txt
Here is a description of the specific files used in the masking. How Sequence Masking Works
This figure illustrates how "sequence masking" works. Various regions
of the genomes are annotated with different structural features, such
as transmembrane helices or homology to known structure. Sometimes
these features overlap, as is often the case for TM-helices and
low-complexity regions. After "masking" the first four structural
features (PDB matches, low-complexity regions, TM-helices, and
linkers), one is left with uncharacterized regions, which can be
characterized by a limited amount of structure prediction.
Abbrev. |
Num |
Full Mask Name |
Sequence file IN |
Sequence file OUT (MBY = masked by) |
pdb |
1 |
minscop soluble matches |
seq |
seq MBY pdb |
pdn |
1 |
minscop soluble matches no overlap (best choice for PDB matches) |
seq |
seq MBY pdb |
pdo |
1 |
minscop soluble matches overlap |
seq |
seq MBY pdo |
lcl |
2 |
low complexity long |
seq |
seq MBY lcl |
lcl |
2 |
low complexity long |
seq MBY pdb |
seq MBY pdb MBY lcl |
tms |
3 |
tm segs |
seq |
seq MBY tms |
tms |
3 |
tm segs (best choice for TM segments) |
seq MBY pdb MBY lcl |
seq MBY pdb MBY lcl MBY tms |
tmf |
3 |
tm segs filtered |
seq MBY pdb MBY lcl |
seq MBY pdb MBY lcl MBY tmf |
lnk |
4 |
linkers |
seq MBY pdb MBY lcl MBY tms |
seq MBY pdb MBY lcl MBY tms MBY lnk |
lnk |
4 |
linkers |
seq |
seq MBY lnk |
ucd |
5 |
unchar domains |
seq |
seq MBY ucd |
cdo |
6 |
characterized domains |
seq |
seq MBY cdo |
alp |
7 |
alla segs |
seq MBY pdb MBY lcl MBY tms MBY lnk |
seq MBY pdb MBY lcl MBY tms MBY lnk MBY alp |
bet |
8 |
allb segs |
seq MBY pdb MBY lcl MBY tms MBY lnk MBY alp |
seq MBY pdb MBY lcl MBY tms MBY lnk MBY alp MBY bet |
nul |
9 |
full len segs |
seq |
seq MBY nul |
Two specific limitations to the fold-usage analysis presented here are discussed below.
The situation is expected to improve in the future as new structures are determined, but it will be a while before all the folds in a genome are known -- especially considering that the increase in new folds is much slower than the increase in new structures (Brenner et al., 1997). An important corollary of this is that the absolute counts found in a given genome survey are (usually) an under-representation of the true numbers. Furthermore, they are contingent on the evolving contents of the databank. Thus, over time as more structures are added to the databank, one should expect such statistics as the most common folds and number of shared folds to change somewhat.
Comprehensive application of ab initio structure prediction and advanced sequence-comparison and fold recognition methods to complete genomes can overcome somewhat the limitations of only knowing a small number of folds (Fischer & Eisenberg, 1997; Gerstein, 1997), allowing one to survey the complete inventory of proteins in an organism. However, in its present form, structure prediction is not a substitute for structure determination, especially in situations where the fold is completely new. Moreover (as discussed below) using state-of-the-art sequence comparison methods introduces a measure of variability and uncertainty into the results, as different methods will give different results at the margin.
In summary, different comparison programs and fold databases will give different numbers.
First and most simply, there is a considerable disparity between how often a fold occurs in the genomes (i.e. how many total matches it has) and how often it occurs in the PDB (i.e. how many known structures have this fold). This is indicated in Figure 1@@@ (and in the web presentation). One can immediately see how different the common folds are in the PDB versus in the genomes. This illustrates in a direct sense the biases in the PDB -- although these sort of biases are not expected to affect the results (which are principally concerned with "membership" rather than absolute counts).
Second and more subtly, the composition of the PDB is biased towards folds that occur in more heavily studied organisms such as EC and SC. These biases are probably reflected in some of the results -- specifically, in the finding that there are many more known folds and unique folds in the bacterium HI than in the archeon MJ, even though both of these organisms have genomes of approximately the same size.
Another subtle bias in the results here is in the selection of genomes. The eight organisms picked were the first with complete genomes to be sequenced, as has by necessity been done in all the other multi-genome comparisons to date (e.g. Tatusov et al., 1997). A more balanced comparison would perhaps have a more comparable amount of eukaryotes and archaea to bacteria.