Man Wah So
MBB 452a
Prof. Mark Gerstein
Final Project
An Overview of
Three Approaches to Construct Phylogeny Based On Entire Genomes
The advent of phylogenetic
analysis based on molecular sequence information has radically transformed our
perception of evolution. Based upon
rRNA-like sequence comparisons, Woese proposed the addition of a new taxon,
“domain,” above the level of kingdom in the old five-kingdom taxonomy and
classified organisms into three domains—the Archaea, the Bacteria, and the
Eucarya—according to their differences at the molecular level (8). Although Woese’s idea has generally been
accepted, many trees that depict evolutionary distances between genes have
shown ambiguous results or even failed to support the three-domain system (1, 2,
4, 5). In order to complement
information attained from traditional sequence comparisons of individual genes,
researchers have consequently turned to entire genomes for phylogenetic
reconstruction. This paper aims at
describing three different approaches proposed by Sankoff, Snel, and Tekaia in
the past decade to infer phylogeny from organellar and complete genomes. The availability of complete genome
sequences has not only provided alternative approaches to assess Woese’s
three-domain proposal, it has also enriched our understanding of evolution at
the molecular level.
In a 1992 paper, “Gene order
comparisons for phylogenetic inference: Evolution of the mitochondrial genome,”
Sankoff et al investigated the feasibility of inferring evolutionary
relationships from macrostructures of entire genomes (6). Although evolutionary inference based on
molecular information has traditionally compared homologous versions of a
single gene in different organisms, such comparisons have been limited in that
they are based on point mutations only.
Often, rapid rates of nucleotide substitutions make distinguishing
homology between related genes from noise levels a difficult task. By comparing gene orders rather than
individual sequences, however, Sankoff’s proposal suggests an alternative route
to circumvent the aforementioned problem.
At the genome level, chromosomal inversions, transpositions, insertions,
and deletions, rather than nucleotide substitutions, are the major contributors
that determine the evolutionary distance between organisms. Therefore, in the study, Sankoff defined an
evolutionary edit distance, E(a, b),
as the number of elementary events—inversions, transpositions, and deletions or
insertions—necessary to change the gene order of one circular genome a into that of another, b.
He then used the obtained evolutionary edit distance between
mitochondrial genomes and constructed a database of sixteen mitochondrial gene
orders from fungi and other eukaryotes (6).
Observing that trees based on gene order comparisons exhibit branching
orders that correspond almost perfectly to accepted evolutionary knowledge,
Sankoff concluded that macrostructures of genomes contain meaningful
information for phylogenetic reconstruction (6).
Similarly, in addressing the
observed inconsistencies in species phylogenies based on sequence comparisons
of individual genes, Snel turned to the use of whole-genome trees in 1999. In their paper, “Genome phylogeny based on
gene content,” Snel et al presented an integrative view of genome analysis
based on shared gene content and defined the similarity between two genomes as
the number of genes that they have in common divided by their total number of
genes (7). In this approach, lists of
pairs of homologous sequences were first compiled from a Smith-Waterman
comparison (at the amino-acid level) of all the genes between two genomes using
a cutoff value of E=0.01. Then, pairs
of genes that are each other’s “closest relative” in their respective genomes
were selected to determine the number of genes shared between two genomes
(7). Comparing protein sequences
encoded by thirteen completely sequenced genomes of unicellular species with
one another, Snel observed that the number of genes two genomes have in common
depends on their evolutionary distance (7).
When he derived a genome phylogeny from shared gene content, Snel found
that the tree reflected the standard phylogeny based on rRNA sequence identity
(7). The study suggests that in cases
where inconsistencies in single-gene trees are observed due to horizontal gene
transfer, phylogenetic analysis based on gene content may provide a more
representative view of the evolutionary differences among organisms at the
molecular level (7).
While evolutionary descent
was the basis for both Sankoff’s and Snel’s studies, Tekaia, instead, attempted
to derive genome phylogeny from a hierarchical classification of genomes. In a 1999 paper, “The genomic tree as
revealed from whole proteome comparisons,” Tekaia et al constructed genomic
trees from twenty completely sequenced genomes through whole proteome
comparisons, taking into account the predicted gene product content of each
organism and their overall similarity (9).
First, the full set of predicted gene products of a completely sequenced
organism was compared with itself and with that of every other organism. Then, the proportion of ORFs in organism j that have at least one similar ORF in
organism i, Tij, was determined for all possible pairs of n organisms to generate an n x
n matrix. Using correspondence
analysis on the matrix of Tij’s,
distances between organisms were subsequently calculated and used for the
construction of genomic trees (9).
Although Tekaia’s trees were essentially phenograms, he named them
genomic trees due to their resemblance to Woese’s and other sequence-based
phylogenies. The correspondence of
Tekaia’s trees, which embodied sequence divergence, gene acquisition and
losses, with Woese’s rRNA-based tree, which was based solely on sequence
divergence, suggests that the average duplication and deletion events that have
taken place through evolutionary time are statistically similar in related
organisms (9).
Since Woese’s proposal for
the three-domain classification scheme, the importance of understanding
differences among organisms at the molecular level in the study of evolution
has been increasingly recognized. Where
single-gene comparisons have failed, researchers have turned to comparisons of
entire genomes for evolutionary inference; in turn, these studies are used to
assess the fitness of previously derived phylogenies. Although all three studies discussed in this paper showed results
that support Woese’s rRNA-based phylogeny, the validity of his three-domain
proposal has yet to be confirmed. In
particular, whole-genome phylogenies to date were based only on a limited
number of available genome sequences, and hence their results may not be
representative of the whole picture.
Only when more genome sequences are available will we be able to refine
details of phylogenetic trees and properly assess the evolutionary
relationships among organisms on this planet.
References:
1.
Brown, J.R., and W.F.
Doolittle. 1995.
Root of the universal tree of life based on ancient aminoacyl-tRNA
synthetase gene duplications. Proc. Natl. Acad. Sci. 92: 2441-2445.
2.
Cavalier-Smith, T. 1989. Molecular
phylogeny. Archaebacteria and
Archezoa. Nature 339: 100-101.
3. Doolittle, R.F. 1998. Microbial genomes opened up. Nature 392:
339-342.
4.
Forterre,
P., N. Benachenhou-Lahfa, F. Confalonieri, M. Duguet, C. Elie, and B. Labedan. 1992.
The nature of the last universal ancestor and the root of the tree of
life still open questions. Biosystems 28: 15-32.
5.
Gupta, R.S. 1998. Protein phylogenies
and signature sequences: A reappraisal of evolutionary relationships among
Archaebacteria, Eubacteria, and Eukaryotes.
Microbiol. Mol. Biol. Rev. 62: 1435-1491.
6.
Sankoff, D., G., Leduc, N.
Antoine, B. Paquin, B.F. Lang, and R. Cedergren. 1992. Gene order comparisons for phylogenetic
inference: Evolution of the mitochondrial genome. Proc. Natl. Acad. Sci. 89: 6575-6579.
7.
Snel, B., P. Bork, and M.A.
Huynen. 1999.
Genome phylogeny based on gene content.
Nat. Genet. 21: 108-110.
8.
Woese, C.R. 1990. Towards a natural
system of organisms: Proposal for the domains Archaea, Bacteria, and
Eucarya. Proc. Natl. Acad. Sci. 87: 4576-4579.
9.
Tekaia, F., A. Lazcano, and
B. Dujon. 1999.
The genomic tree as revealed from whole proteome comparisons. Genome
Res. 9: 550-557.