THE EVOLTION OF PHYLOGENETIC CLASSIFICATION:
From 16S rRNA to the Genomic Tree
Ann L. Miller
Department of Molecular Biophysics and Biochemistry, New Haven, CT
Classifying organisms into an ordered scheme has long been a goal for scientists. Aristotle viewed life as a continuum, from lifeless forms through the complexity of plants and animals (8). In the 19th century, Darwins theory of evolution suggested that the ordering principle is shared decent from common ancestors (2). Classification entered a time of stasis from Darwin until the mid 20th century because organisms could be classified only by phenotype. With the advent of molecular techniques, sequencing of genes, and genome projects, whole new methods of phylogenetic classification were born and have evolved extensively in the past ten years. This paper reviews some of the major schemes of molecular phylogenetic classification.
THE UNIVERSAL TREE OF LIFE: 16S rRNA
Carl Woese began assembling 16S rRNA data in the mid 1970s; the current Universal Tree of Life rests on that data (2). rRNA was chosen as the molecule for comparison because of its "ancient and essential function in the cellular economy and its interactions with many (well over 100) other coevolved cellular RNAs and proteins" (2). Every cell has rRNA, the sequences have few mutations, and it is unlikely that rRNA genes experienced lateral gene transfer (LGT) (2). The rRNA trees are created by sequence comparisons of small subunit rRNA genes. The rRNA data led Woese to propose that Archaea are the third domain of life, joining Bacteria and Eukarya (11), a proposal that has come under attack recently (9). Woese contends that the three domains are unique (11), while one of his critics, Ernst Mayr, says Woese places too much importance on the Archaea and that the differences between Bacteria and Archaea are not sufficient to warrant placing them in separate domains (6). Although the position of the Archaea in the Universal Tree is still debatable, most agree to use the three domain classification determined by the rRNA tree as the standard for comparison with other trees (3).
MULTIPLE PROTEIN COMPARISON TREES
The 16S rRNA tree is not an organismal phylogenetic tree; it is a gene tree (10). To move towards organismal phylogeny, scientists began creating trees based on other proteins. In many cases, the other phylogenies do confirm the rRNA tree, but no one consistent phylogeny has emerged. Other approaches, like that of Russell Doolittle (3), have used multiple protein comparisons to form trees. Doolittle aimed to assign not only evolutionary distances but also divergence times between organisms (1). The trees were constructed with 64 sets of enzymes that represent species from all three domains. First, the evolutionary distances were calculated in pair-wise fashion using the progressive method in conjunction with the BLOSUM-62 matrix (3). Then, a "distance clock" was calibrated according to the vertebrate fossil record, and extrapolations estimated the divergence times of the enzyme groups (3). Doolittles study raises some intriguing questions related to the controversial issue of LGT. Doolittles results indicate that LGT has played a pivotal role for many genes. This finding also sheds light on the discrepancies seen among trees constructed with single proteins; it is likely that differences among phylogenies based on different genes may be due to LGT (9).
THE SHARED GENE TREE
With availability of completely sequenced genomes comes the possibility of constructing phylogenetic trees that are more representative of organisms than individual genes or sets of genes. Berend Snel recently reported the Shared Gene Tree, a distance based phylogeny that is constructed on the basis of gene content, not sequence identity (7). The trees were constructed using 13 complete genomic sequences of unicellular organisms. First, similarity was computed by determining the number of genes two organisms have in common (similarity = # of genes in common / total # of genes) (7). Organisms with larger genomes also contain more shared genes, so this equation corrects for varying genome size. The phylogeny of the genomes was established using a neighbor joining algorithm, and random subsets of the genes were used for bootstrapping (7). According the Snell, "The resulting tree reflects the standard phylogeny as based on 16S rRNA" (7). This was an important result because it independently reproduced the phylogeny obtained by sequence identity.
WHOLE PROTEOME GENOMIC TREES
In a complementary approach to Snels, Bernard Dujon recently presented the Genomic Tree (9). This approach also offers an integrative look at genome evolution, taking into account genome content, loss or acquisition of genes, and overall genome redundancy. In this scheme, 20 completely sequenced genomes, as well as the partial sequences of S. pombe, H. sapiens, and M. musculus, were compared (9). The Genomic Tree was created by whole proteome comparisons. Functional identity was not taken into account; instead, the tree is based on the presence or absence of genes of common ancestry. Specifically, the full set of predicted ORFs was compared with BLASTP using the pam250 matrix. Each gene product was used as a query sequence for comparison with itself and with each other organism. The similarity of each ORF to any other was defined. Whole organisms were compared on a pair-wise level to determine how many ORFs were similar, and a value was assigned for the similarity of one organism to another. Then, a matrix of pair-wise comparisons was constructed, and distances between organisms were calculated using the matrix (9). The Genomic Tree shows strong similarity to the 16S rRNA tree, again indicating that classification can be examined at a genomic level, and this type of approach can aid determination of evolutionary descent (9).
Clearly, it is necessary for the Universal Tree of Life to reflect evolutionary relationships of whole organisms, not just single genes. Phylogenetic classification has evolved with the advent of molecular techniques and genome projects. However, analysis on a genomic scale will not be possible until more genomic sequences are available. One issue that must be resolved is the debate over how much of a role LGT plays. If LGT does play a major role in evolution, perhaps hierarchical classification methods that consider only vertical evolution are not even the proper way to go about classifying organisms. As more genomes become available, the field of phylogenetic classification will continue to evolve, and a better understanding of evolution will be reached.
Table 1. Comparison of the four types trees discussed in this review.
Type of phylogenetic tree |
Popularized by: |
Year Published: |
# of genes compared per organism: |
Need complete genomic sequence? |
Comparisons based on: |
Trees indicate: |
16S rRNA |
Carl Woese |
1987 |
1 |
no |
Gene Sequence identity |
Evolutionary descent (gene ® organism) |
Multiple Protein Comparison |
Russell Doolittle |
1996 |
64 |
no |
Amino acid sequence identity |
Evolutionary distance and divergence time |
Shared gene tree |
Berend Snel |
1999 |
all |
yes |
Gene content |
Evolutionary descent (genome) |
Genomic Tree |
Bernard Dujon |
1999 |
all |
yes |
Gene content |
Hierarchical classification of genomes (phenogram not phylogenetic tree) |
REFERENCES
Outtakes
In the 18th century, Linnaeus developed the binomial system of species classification, a system which is still used today (8).
Further, Mayr believes that having three domain overlooks the extreme diversity of the eukaryotes and as a matter of balance, the two Prokaryotic domains should be one. He states, "To sweep all this [eukaryotic diversity] under the rug and claim that the difference between in two kinds of bacteria is of the same weight as the difference between the prokaryotes and the extraordinary world of the eukaryotes strikes me as incomprehensible" (6).
Likely, many Archaea have not even been discovered yet, and much more will be uncovered as more Archaeal genomes are sequenced and compared.
The original paper was actually quite controversial because it reported the date for the last common ancestor of bacteria and eukaryotes to be a little over 2 billion years ago, and many felt that this time was much too short and should be more like 3.5 billion years according to the fossil record (1). A second paper corrected for two major sources of error by using Grishins formula that corrects for amino acid interchange and variation in substitution rate and by better representing Archaea in the comparison (1).
Lake is particularly interested in recent results that have demonstrated that informational genes (transcription, translation) are rarely transferred, while operational genes (housekeeping) undergo extensive lateral transfer, suggesting that LGT is not a random event (4). This could explain why rRNA (informational) is a good molecule to base phylogenies upon, while some other genes (operational) show inconsistent results.
James Lake suggests that science has relied too heavily on rRNA and as a result consideration of LGT has been overlooked for the most part until now (4, 5).