Jodi Vanden Eng
MB&B
452a
Fall
2000
Controlling
for Biases in Sequence Comparisons
The ability to extract information from
protein and DNA databases has improved dramatically over the past few
years. Computers are faster, comparison
algorithms are more advanced and effective, and databases have become larger,
more comprehensive, and more accessible.
Consequently, database analysis and bioinformatics have firmly
established their place in molecular biology, and are recognized as valuable
tools to aid in the understanding of biological systems and pathways [1].
Biological databases have become
extremely versatile and useful tools for molecular biologists and they differ
greatly in the size and complexity of information provided [1]. A wide spectrum
of information is represented across different databases including information
on DNA sequences, protein sequences, genome sequences, structural information,
and expression data (see Luscomb et al for an overview of the sources of
information [1]). However, due to the
complex nature of the information provided in these databases, there are many
associated issues that must be addressed in order to fully understand the
effectiveness of the database searches and sequence comparisons [2]. As
described by Altschul et al, one must take into consideration the choice of
scoring systems, the statistical significance of alignments, the masking of
uninformative or potentially confounding sequence regions, and database
redundancy and sequence repetitiveness [2].
The last of these issues, database
redundancy and sequence repetitiveness, can have severe implications in
comparison studies. The primary concern
is the degree that these databases are biased by the disproportionate
representation of different organisms, sequences and proteins. An article by Gerstein, attempted to
address this question by looking at known structures in the Protein databank
(PDB) and comparing them to proteins encoded by eight complete microbial
genomes [3]. This article
found differences in sequence length, composition and secondary structure
between the known structures and encoded proteins, demonstrating that potential
biases should be taken into account during analyses [3]. Another type of bias occurs when dealing
with protein structure due to its multi-domain nature [4]. The objective of this paper is to identify
the reasons for bias in the database, and describe some methods presented in
current literature used to control for bias.
One of the most important issues in
doing a large-scale survey is avoiding biases.
There are several reasons why some types of sequences or structures may
be over-represented or under-represented in the databanks [4]. First, many of the sequences and structures
may be over-represented as a result of investigator preference. Moreover,
databases are often biased towards frequently studied common or
"model" organisms that may be over-represented. Also, these databases are potentially biased
towards organisms of commercial relevance (eg., those areas that have
agricultural or medical implications). Alternatively, particular sequences may
be under-represented simply because not all of the sequences are completely
know. In addition, particular sequences
or structures may be under-represented due to the technical difficulty or
physical constraints (eg., constraints on what organisms can be isolated or
what proteins are easily crystallized).
Several approaches of weighting data
have been developed in order to avoid misleading results due to the unequal
representation of sequences. One
approach, as described by Altschul et. al. [5]
considers the use of an rooted evolutionary tree to assign weights to
individual observations and then generalizes the approach to the multiple
alignment problem. This method assumes
that a tree is either known or constructed.
The root of the tree is the point of interest, with longer branches as
less reliable estimators. Using the
branching pattern of the tree, with the length of a given edge corresponding to
amount of evolution that has occurred, a maximum likelihood estimate is used to
determine the weight of particular organisms.
A species receives lower weight when it is far from the root, or when it
has "close neighbors" in the tree [5].
An alternative to the approach
described above is the method by Sibbald and Argos [6]. This method is preferred when the sequences
are not known to be phylogenetically related or they cannot be produced without
distorting the distances between the sequences [7]. This approach weights each sequence
according to its Voronoi volume in sequence space. The "Voronoi" method takes into account distances to
other sequences, but not the centroid.
Each sequence has a "voronoi-cell" attributed to it, which
includes the set of points closest to this sequence. The more isolated the sequence is, the greater its volume. The volume is calculated by a Monte Carlo
algorithm that builds random sequences from amino acids at each alignment
position. Each of the sequences closest
to the random sequence is then weighted by the total number of related
sequences (1/n) [6]. In this method, a
species receives higher weight when it is rare or outlying because it conveys
more information about the root.
A third method of weighting aligned
biological sequences is outlined in Gerstein et. al. [8]. The basis for this weighting scheme is to
use the distance between the sequences to cluster them in a bifurcating
tree. The tree is constructed using
averaging of pairwise distances between sequences (measured by percentage
residue identity). To calculate weights, they count the distance between nodes
in the tree, the points of bifurcation in the tree. If an edge or distance is shared between subtrees, the weight is
then incremented for all of the sequences in each subtree. The process begins from the node closest to
100% identity (leaves) and continues to the one closest to 0% (root). The weight added to a subtree is the length
of the edge connecting it to the node of interest. This method is comparable to
the weights calculated from Sibbald and Argos, however the advantage of this
method is that it is conceptually simple and is less computationally intensive
[8]. In this weighting scheme, low
weights are given to closely related sequences.
Optimally, the best way to overcome
biases in the database would be to wait until a complete database has been
developed. Unfortunately, this may take
more time than biologists care to afford (one approximation is that all
structures will not be known until 2050 [4]).
In the mean time, sequence and/or structural comparisons and other
prediction analyses will have to rely on methods to control for potential
biases in the database such as those described above. However, these methods still rely on many assumptions. First, they assume that all types of
proteins or sequences have at least 1 branch "representative" in the
database. In addition, they assume that
all proteins and sequences are equally represented in nature (they may be
overcompensating).
1. Luscombe,
N.M., D. Greenbaum, and M. Gerstein, What
is Bioinformatics? An Introduction and
Overview. IMIA yearbook (in press), 2001. 2001.
2. Altschul,
S.F., et al., Issues in searching
molecular sequence databases. Nat Genet, 1994. 6(2): p. 119-29.
3. Gerstein, M., How representative are the known structures
of the proteins in a complete genome? A comprehensive structural census.
Fold Des, 1998. 3(6): p. 497-512.
4. Gerstein, M.
and H. Hegyi, Comparing genomes in terms
of protein structure: surveys of a finite parts list. FEMS Microbiol Rev,
1998. 22(4): p. 277-304.
5. Altschul, S.F., R.J. Carroll, and D.J. Lipman, Weights for data related by a tree. J
Mol Biol, 1989. 207(4): p. 647-53.
6. Sibbald, P.R.
and P. Argos, Weighting aligned protein
or nucleic acid sequences to correct for unequal representation. J Mol
Biol, 1990. 216(4): p. 813-8.
7. Vingron, M.
and P.R. Sibbald, Weighting in sequence
space: a comparison of methods in terms of generalized sequences. Proc Natl
Acad Sci U S A, 1993. 90(19): p.
8777-81.
8. Gerstein, M., E.L. Sonnhammer, and C. Chothia, Volume changes in protein evolution. J
Mol Biol, 1994. 236(4): p. 1067-78.