Jodi Vanden Eng

MB&B 452a

Fall 2000

Controlling for Biases in Sequence Comparisons

The ability to extract information from protein and DNA databases has improved dramatically over the past few years.  Computers are faster, comparison algorithms are more advanced and effective, and databases have become larger, more comprehensive, and more accessible.  Consequently, database analysis and bioinformatics have firmly established their place in molecular biology, and are recognized as valuable tools to aid in the understanding of biological systems and pathways [1].

Biological databases have become extremely versatile and useful tools for molecular biologists and they differ greatly in the size and complexity of information provided [1].  A wide spectrum of information is represented across different databases including information on DNA sequences, protein sequences, genome sequences, structural information, and expression data (see Luscomb et al for an overview of the sources of information [1]).  However, due to the complex nature of the information provided in these databases, there are many associated issues that must be addressed in order to fully understand the effectiveness of the database searches and sequence comparisons [2].  As described by Altschul et al, one must take into consideration the choice of scoring systems, the statistical significance of alignments, the masking of uninformative or potentially confounding sequence regions, and database redundancy and sequence repetitiveness [2].

The last of these issues, database redundancy and sequence repetitiveness, can have severe implications in comparison studies.  The primary concern is the degree that these databases are biased by the disproportionate representation of different organisms, sequences and proteins.   An article by Gerstein, attempted to address this question by looking at known structures in the Protein databank (PDB) and comparing them to proteins encoded by eight complete microbial genomes [3].  This article found differences in sequence length, composition and secondary structure between the known structures and encoded proteins, demonstrating that potential biases should be taken into account during analyses [3].   Another type of bias occurs when dealing with protein structure due to its multi-domain nature [4].  The objective of this paper is to identify the reasons for bias in the database, and describe some methods presented in current literature used to control for bias.

One of the most important issues in doing a large-scale survey is avoiding biases.  There are several reasons why some types of sequences or structures may be over-represented or under-represented in the databanks [4].  First, many of the sequences and structures may be over-represented as a result of investigator preference. Moreover, databases are often biased towards frequently studied common or "model" organisms that may be over-represented.  Also, these databases are potentially biased towards organisms of commercial relevance (eg., those areas that have agricultural or medical implications). Alternatively, particular sequences may be under-represented simply because not all of the sequences are completely know.  In addition, particular sequences or structures may be under-represented due to the technical difficulty or physical constraints (eg., constraints on what organisms can be isolated or what proteins are easily crystallized).

Several approaches of weighting data have been developed in order to avoid misleading results due to the unequal representation of sequences.  One approach, as described by Altschul et. al. [5] considers the use of an rooted evolutionary tree to assign weights to individual observations and then generalizes the approach to the multiple alignment problem.  This method assumes that a tree is either known or constructed.  The root of the tree is the point of interest, with longer branches as less reliable estimators.   Using the branching pattern of the tree, with the length of a given edge corresponding to amount of evolution that has occurred, a maximum likelihood estimate is used to determine the weight of particular organisms.  A species receives lower weight when it is far from the root, or when it has "close neighbors" in the tree [5].  

An alternative to the approach described above is the method by Sibbald and Argos [6].   This method is preferred when the sequences are not known to be phylogenetically related or they cannot be produced without distorting the distances between the sequences [7].   This approach weights each sequence according to its Voronoi volume in sequence space.  The "Voronoi" method takes into account distances to other sequences, but not the centroid.  Each sequence has a "voronoi-cell" attributed to it, which includes the set of points closest to this sequence.  The more isolated the sequence is, the greater its volume.  The volume is calculated by a Monte Carlo algorithm that builds random sequences from amino acids at each alignment position.  Each of the sequences closest to the random sequence is then weighted by the total number of related sequences (1/n) [6].   In this method, a species receives higher weight when it is rare or outlying because it conveys more information about the root.

A third method of weighting aligned biological sequences is outlined in Gerstein et. al. [8].  The basis for this weighting scheme is to use the distance between the sequences to cluster them in a bifurcating tree.  The tree is constructed using averaging of pairwise distances between sequences (measured by percentage residue identity). To calculate weights, they count the distance between nodes in the tree, the points of bifurcation in the tree.  If an edge or distance is shared between subtrees, the weight is then incremented for all of the sequences in each subtree.  The process begins from the node closest to 100% identity (leaves) and continues to the one closest to 0% (root).   The weight added to a subtree is the length of the edge connecting it to the node of interest. This method is comparable to the weights calculated from Sibbald and Argos, however the advantage of this method is that it is conceptually simple and is less computationally intensive [8].  In this weighting scheme, low weights are given to closely related sequences.

Optimally, the best way to overcome biases in the database would be to wait until a complete database has been developed.  Unfortunately, this may take more time than biologists care to afford (one approximation is that all structures will not be known until 2050 [4]).  In the mean time, sequence and/or structural comparisons and other prediction analyses will have to rely on methods to control for potential biases in the database such as those described above.  However, these methods still rely on many assumptions.  First, they assume that all types of proteins or sequences have at least 1 branch "representative" in the database.  In addition, they assume that all proteins and sequences are equally represented in nature (they may be overcompensating).



1.  Luscombe, N.M., D. Greenbaum, and M. Gerstein, What is Bioinformatics?  An Introduction and Overview. IMIA yearbook (in press), 2001. 2001.

 

2.  Altschul, S.F., et al., Issues in searching molecular sequence databases. Nat Genet, 1994. 6(2): p. 119-29.

 

3.  Gerstein, M., How representative are the known structures of the proteins in a complete genome? A comprehensive structural census. Fold Des, 1998. 3(6): p. 497-512.

 

4.  Gerstein, M. and H. Hegyi, Comparing genomes in terms of protein structure: surveys of a finite parts list. FEMS Microbiol Rev, 1998. 22(4): p. 277-304.

 

5.   Altschul, S.F., R.J. Carroll, and D.J. Lipman, Weights for data related by a tree. J Mol Biol, 1989. 207(4): p. 647-53.

 

6.  Sibbald, P.R. and P. Argos, Weighting aligned protein or nucleic acid sequences to correct for unequal representation. J Mol Biol, 1990. 216(4): p. 813-8.

 

7.  Vingron, M. and P.R. Sibbald, Weighting in sequence space: a comparison of methods in terms of generalized sequences. Proc Natl Acad Sci U S A, 1993. 90(19): p. 8777-81.

 

8.   Gerstein, M., E.L. Sonnhammer, and C. Chothia, Volume changes in protein evolution. J Mol Biol, 1994. 236(4): p. 1067-78.