# Error and Reliability Analysis of PartsList Database Statistics

## Reliability of FASTA Run Statistics

The FASTA Genome Comparisons were run with a 10^-4 threshold (FASTA value of .01). This translates into one error for every 100 matches (1, 2).

Let us consider some examples:

For C. elegans, a total of 7,803 matches were found over 247 folds. This works out to approximately 31 matches per fold. If, on average, one in 100 matches are in error, then about 78 matches are in error. Since there are 31 matches per fold, the aggregate error in the fold count number, a worst case analysis suggests that the aggregate error in the fold count number could be around 3.

Similarly, for E. coli, 1,600 matches translates to about 16 errors. With 7 matches per fold, this suggests the fold count number could be off by about 2.

For Chlamydia, 348 matches yields 3 errors. With 2.6 matches per fold, under the worst-case assumptions we would expect an aggregate error in the fold-count numbers of about 1.

## Reliability of Expression Statistics

The worst case error rates in the expression data used are +100%/-50% (3). These data are the averaged together across folds, where we may assume that the errors cancel out.

If we approximate the errors as normally distributed (which the Central Limit Theorem gives us some justification for doing), then, essentially, the standard deviation is approximately one-half the magnitude of the data.

Therefore, if we average the data across N folds, we can expect the standard deviation to be reduced to sqrt(N(0.5*x)^2)/N (because the variance of the sum will be the sum of the variances, so the variance of the average will be the sum of the variances over N), so that the standard deviation becomes 0.5sqrt(N)/N x across a fold class with N representatives, which translates into an error of +100%/sqrt(N) and -50%/sqrt(N), and these may be used to produce estimated worst-case error bars for each of the folds. Some folds have one representative, so the error on these folds is approximately +100%/-50%, while others have hundreds of folds, suggesting a worst-case error rate of roughly +/-10%. This analysis makes the worst-case assumptions; in practice, actual data should be more accurate then these results suggest.

## Reliability of Alignment Statistics

The errors in the alignments are largely dependent on the errors in the RMS values computed for the pairs (4). These RMS calculations were done by the wgkalign program, which uses a modified-sieve fit algorithm that eliminates from consideration, via an iterative process, the worst-fitting half of the atoms, a procedure that, if anything should improve the overall error-resistance of the process as compared with a traditional RMS fit. We would expect the error to be derived purely from errors in the positions of the atoms in the PDB. A reasonable value for these errors might be 0.1 to 0.2 Angstroms. An RMS, traditional or that computed by the wgkalign program, can be expected to eliminate random but not systematic error, so that an error of 0.1 to 0.2 Angstrom in the final RMS value represents a truly worst-case scenario of systematic error in the underlying PDB file.

This error rate is far lower than that which can be expected to be introduced through SCOP classification error alone, and so we would expect SCOP classification error to be the dominant source of error in these data.

## Reliability of Motions Statistics

The errors in the motions data submitted to PartsList are entirely derived from the output of the wgkalign program (5). In this way, errors will be very similar to those obtained in the section above on Errors in Alignments.

For the hinge motion data, also derived from wgkalign output, the same analysis may be applied by approximating the sin function as the identity function (the first term in the Taylor expansion). Rotations may range from 0 to 180 degrees, so an error of 0.1 to 0.2 Angstroms translates into an error of less than 10% in terms of degrees.

## Reliaiblity of Amino Acid Composition Statistics

The error in this case would reflect sequencing and data entry errors in the sequences of proteins with solved structures. Given the difficult of solving the structure of a protein with an incorrect sequence, we would expect such errors to be very low indeed. Amino acid composition should be the most reliable data in the Partslist database, with the error dominated by classification errors in the SCOP database.

## Reliability of domain interactions determined

The domain interactions are determined from the PDB and yeast genome data, and are reported in the Interaction Report for individual PDB or SCOP entries, and in Fold/Superfamily Reports. In the Interaction Reports for individual PDB Reports, the data is derived from analysis of the atomic coordinates in the PDB entry and the SCOP domain information. The false positive interactions in the PDB will only be interactions between chains that come from crystal contacts rather than real interactions. The rate of false positives of interactions between chains in PDB entries is estimated to be 6.3% of all entries by Irene Nooren (personal communication) due to crystal contacts. The rate of false positives at the level of the superfamilies/folds will be lower than this, as some crystal contacts won't change the superfamily contact,and many false positives will cluster in the same superfamily/fold (e.g. crystal contacts in lysozyme, of which there are hundreds of PDB entries).

For the fold and superfamily reports, for which one can also choose yeast experimental data (note: only a fraction of this is derived by the yeast-two-hybrid method) we have to consider errors in the genetic/physical/biochemical experiments to derive interactions between yeast proteins. The fold/superfamily interactions are derived from structural assignments to the yeast sequences, intramolecular from adjacent domains and intermolecular from single-domain proteins that interact as known from the experimental data. For the intramolecular interactions, the error rate is from domains that are adjacent (within 30 amino acids) but don't interact. These are a few special cases, such as DNA-binding domains where DNA intercalates between adjacent domains, or cases where another polypeptide chain slips between 2 domains, such as Phe-tRNA synthetase. The pairs of superfamilies where this was observed in the PDB were automatically eliminated from the yeast assignments analysis.

For the yeast intermolecular interactions, there are 55 superfamily-level interactions, and 28 of these are also observed in the PDB, and therefore are highly unlikely to be false positives. Of the 27 not observed in the PDB, additional types of experiments (genetic mutation rescue or physical experiments such as affinity chromatography, etc.) support another 12 at the superfamily interaction level. Only 15 superfamily-level interactions are supported solely by the relatively error-prone yeast-two-hybrid. In some assessments, the two-hybrid techniques, on which the data is partially based, has been shown to be not completely reliable. Schwikowski has shown (6) that for function prediction the data in Uetz et al. (7) has a reliability of 72%, meaning approximately 28% of the interactions can potentially be false positives. Given this estimated 28% false positives rate for yeast-two-hybrid (7), we would expect 4-5 of the 15 to be false positives. Looking at the entire group of 55 superfamily-level interactions, this suggests an overall error rate of not more than 10% false positives in the yeast data.

Of course, there are undoubtedly many false negatives, as we do not even have a complete map of all protein superfamilies and folds yet, so a complete understanding of all the interactions between superfamilies and folds is impossible at the present time. At the same time, however, this means that underestimates in the interaction counts (false negatives) are likely to be systematic errors, arising from the fact that our dataset represents a partial sample of the whole, as opposed to random experimental errors. Therefore, assuming our sample is approximately a random subset of the whole (a not entirely unreasonable assumption given the roughness of the error analysis) we would expect the final counts to maintain approximately the same ratios with respect to one another as in our current data. A cruder upper bound may be attempted by noting that the data follow a power law distribution.

## References

1.            H. Hegyi and M. Gerstein, J Mol Biol 288, 147 (1999).

2.            M. Gerstein and H. Hegyi, FEMS Microbiol Rev 22, 277 (1998).

3.            R. J. Lipshutz, S. P. Fodor, T. R. Gingeras and D. J. Lockhart, Nat Genet 21, 20 (1999).

4.            C. A. Wilson, J. Kreychman and M. Gerstein, J Mol Biol 297, 233 (2000).

5.            W. G. Krebs and M. Gerstein, Nucleic Acids Res 28, 1665 (2000).

6.            B. Schwikowski, P. Uetz and S. Fields, Nat Biotechnol 18, 1257 (2000).

7.            P. Uetz, et al., Nature 403, 623 (2000).

## Authors of this document

W. G. Krebs, Sarah A. Teichmann1, Jiang Qian, and Mark Gerstein*

Department of Molecular Biophysics and Biochemistry
Yale University
PO Box 208114, New Haven, CT 06520, USA

1Department Biochemistry & Molecular Biology, University College London, Darwin Bldg, Gower St, London WC1E 6BT, UK

*To whom correspondence should be submitted. E-mail: Mark.Gerstein@Yale.Edu

## Reference for Partslist Database

Qian J, Stenger B, Wilson CA, Lin J, Jansen R, Teichmann SA, Park J, Krebs W, Yu H, Alexandrov V, Echols N, Gerstein M (2001), PartsList: a web-based system for dynamically ranking protein folds based on disparate attributes, including whole-genome expression and interaction information, Nuc Acid Rev, in pre-publication.