Meg Sullivan

MBB 452a

12/10/99

The Role of Bioinformatics in Helping to Solve Many of the Mysteries of Evolution

Theodosius Dobzhanksy once said, "Nothing in biology makes sense except in the light of evolution." (Colby, 1996). Evolution is a natural phenomenon reflected in the apparently purposeful nature of living creatures and helps explain the diversity and complexity of life. Although much time and effort has been spent to solve the mysteries of evolution, a great deal of information remains to be discovered. In the past, evolution has been the province of paleontologists and naturalists. Advances in knowledge about the early stages of evolution, particularly human evolution, depended on researchers finding new specimens that they could analyze. Despite advances in archaeological technology and techniques, this has become increasingly difficult. However, a recent discovery that mutations in DNA are the root of evolution means that molecular biologists can now aid in the study of evolution (Hodge, 1999). The growing field of genomics has allowed scientists to sequence the complete genomes of many organisms. These sequences can then be compared to other known sequences using sequence analysis, which involves drawing biological conclusions from known sequences of monomers in proteins and DNA. By performing such an analysis, biologists can resolve many mysteries and uncertainties in the important subject of evolution.

Carrying out a sequence analysis requires an efficient way of storing all the data to allow for accessibility and comparative genome analysis. Bioinformatics provides the necessary techniques and, as such, is a fast-growing field in the area of evolution. Using these techniques, a scientist can compare the sequence of one gene against all of the known sequences of genes in the DNA databases. There are three different databases worldwide that collaborate to pool the known DNA sequence database entries. These are the European Molecular Biology Laboratory (EMBL), GenBank, and the DNA Databank of Japan (DDBJ). A gene sequence can be compared against these databases to test for similarity. This is a frequent, influential technique in bioinformatics. Unfortunately, to accomplish the task in a reasonable amount of time, researchers must sacrifice a small amount of accuracy.

One of the most commonly used programs for comparing a gene sequence against a sequence database is FASTA. The method uses an algorithm and a "k-tup" parameter which makes approximations in an attempt to find only the most significant alignments. FASTA concentrates on global alignments at the expense of local areas of great similarities. The approach requires exact matches between the query and target sequence over a given length of a hash, determined by the "k-tup" input. Another common program is BLAST (Basic Local Alignment Search Tool) which uses a statistical model to compare the unknown sequence with the database and find the best local alignment. Using hashes, BLAST extends the hash hits to either side until the score plateaus. There are variations of the BLAST methods depending on the probe and database used. For example, BLAST 1 does not permit gaps in the alignments. The Smith-Waterman algorithm is a slower, but more accurate method of sequence analysis.

While many evolutionary scientists now focus on an organism's genome rather than larger features, the fundamental principles of evolution still remain the same. In other words, a greater similarity between two species implies a closer evolutionary relationship (Hodge, 1999). Therefore, scientists can draw a number of conclusions by carrying out sequence alignment. These can include not only showing that particular organisms are related to one another, but also that they are not. One of the most publicized findings based on bioinformatics was the fact that Neanderthals are not the direct ancestors of humans. The basic approach involves comparing two or more sequences so that optimal alignment can be found. The information is recorded in a scoring matrix, which varies depending on the degree of similarity needed. An efficient way of aligning sequences is done using the dynamic programming approach. One of the best-known classes of scoring matrixes are the PAM matrices. The PAM250 matrix can be used for evolutionary studies as it detects sequences that diverged a long time ago. The PAM250 matrix looks for an approximately 25 percent sequence identity. It is important to understand that similarity in sequences does not necessarily prove an evolutionary relationship. However, a high sequence similarity does suggest divergence from a common ancestor (Alphey, 1997).

The recent availability of many different fully sequenced genomes also helps in phylogenetic analysis. The evolutionary history of groups of genes with similar functions can now be studied (Lake, et. al, 1999). Previously, the universal tree of life was drawn by studying the 16S-like rRNA genes. This approach, however, is problematic in regards to some genes, such as the genes that encode metabolic enzymes (Tekaia et al, 1999). The available genomes may be able to solve these problems. This approach involves using the methods of sequence alignment described above to determine specific homologies. The next step is the construction of a mathematical model that conveys the evolution in time of the two sequences. One such model is the Markov chain model of evolution. This model is specified by the branching order of the tree, the initial state at the common ancestor, and a transition matrix for each branch. The transition matrix represents the "proportion of ORFs (open reading frames) in organism j that have at least one similar ORF in organism i (Tij)" (Tekaia et. al., 1999). The form of the constraints on the transition matrix determine the character evolution. Using the matrices, the evolutionary distances between organisms is determined. Once the data is collected, a researcher can draw a phylogenetic tree, representing the common ancestry, or lack there of, of different genes.

Bioinformatics is still a relatively new discipline. However, with the growing number of genomes that are fully sequenced along with the increased capability of computers and databases, bioinformatics will surely prove an important and powerful tool in the molecular biology world. One area where this is already becoming apparent is that of evolution. The use of programs such as FASTA and BLAST allow for a gene to be compared against the entire known database of genes. Although some level of accuracy is sacrificed to save time, the approach is still very useful. Moreover, since scientists can only extract DNA from fossils that are no more than 100,000 years old, bioinformatics provides scientists with a tool for analyzing changes that occurred earlier in the evolutionary process. Scientists are now using the tool to develop models of how some of the earliest forms of life evolved (Kyrpides and Ouzounis, 1999). Sequence alignment, using dynamic programming, is also useful is establishing similarity between sequences and hence relationships among organisms. This similarity hypothesizes, but does not entirely prove, homology between genes. Finally, bioinformatics can be used to construct phylogenetic trees that provide a visual representation of evolutionary relationships. As more and more genome sequences are discovered, bioinformatics can only prove more helpful in helping to solve many of the mysteries of evolution.

 

 

 

References:

Alphey, Luke. DNA Sequencing: From Experimental Methods to Bioinformatics. New

York: Springer, 1997.

Colby, Chris. "Introduction to Evolutionary Biology." Talk.Origins Archive. January 7,

1999.

Hodge, Russ. "Armchair Evolution." Bioinformer–Press Release. July 20, 1999.

Kyrpides, Nicos C. and Christos A. Ouzounis. "Transcription in Archaea." Proceedings

of the National Academy of Sciences. July 20, 1999.

Lake, James A., Ravi Jain, and Maria C. Rivera. "Genomics: Mix and Match in the Tree

of Life." Science:283 March 26, 1999.

Tekaia, Fredj, Antonio Lacano, and Bernard Dujon. "The Genomic Tree as Revealed

from Whole Proteome Comparisons. Genome Research. June 1999.