Genome Sequencing: A History

A History of Genome Sequencing:

The sequencing of the human genome along with related organisms represents one of the largest scientific endeavors in the history of mankind. The information garnered from sequencing will provide the raw data for the exploding field of bioinformatics, where computer science and biology live in symbiotic harmony. The large scale sequencing proposed by the Human Genome Project in 1990 could never have been a reality without modern computer facilities. Merely, twenty years ago computers would have been powerless in light of such a daunting amount and array of data. Homologue identification and genome characterization between organisms constituting millions of nucleotides was unimaginable until the rapid advancement of microchips and processors over the past two decades. In addition the first sequenced genome of a live organism, Haemophilus influenzae, would have been impossible without the computational methods developed at the facilities of The Institute for Genetic Research (TIGR). While this is not technically an aspect of bioinformatics, this development would have been impossible with the computers of yesterday. The short history of genome sequencing began with Frederic Sanger's invention of sequencing almost twenty-five years ago.

The art of determining the sequence of DNA is known as Sanger sequencing after its brilliant pioneer. This technique involves the separation of flourescent labeled DNA fragments according to their length on a polyacrilimide gel (PAGE). The base at the end of each fragment can then be visualized and identified by the dye with which it reacts. The time and labor intensive nature of gel preparation and running, as well as the large amounts of sample required, increase the time and costs of genomic sequencing. These conditions drastically reduce the efficiency of sequencing projects ultimately limiting researchers in their sequencing attempts.

Bacteriophage fX174, was the first genome to be sequenced, a viral genome with only 5,368 base pairs (bp). Frederic Sanger, in another revolutionary discovery, invented the method of "shotgun" sequencing, a strategy based on the isolation of random pieces of DNA from the host genome to be used as primers for the PCR amplification of the entire genome. The amplified portions of DNA are then assembled by their overlapping regions to form contiguous transcripts (otherwise known as contigs). The final step involved the utilization of custom primers to elucidate the gaps between the contigs thus giving the completely sequenced genome. Sanger first used "shotgun" sequencing five years later to complete the bacteriophage l sequence that was significantly larger, 48,502 bp. This method allowed sequencing projects to proceed at a much faster rate thus expanding the scope of realistic sequencing venture. Since then a couple of other viral and organellar genomes have been sequenced using similar techniques such as the 229 kb genome of cytomegalovirus (CMV), the 192 kb genome of vaccinia, and the 187 kb mitochondrial and the 121 kb chloroplast genomes of Marchantia polymorpha, and the 186 kb genome of smallpox.

The success with viral genome sequencing stemmed from the relatively small length of their genetic codes. In 1989, Andre Goffeau set up a European consortium to sequence the genome of the budding yeast Saccharomyces cerevisiae (12.5 Mb). Goffeau's European collaboration involved 74 different laboratories drawn to the project in hopes of sequencing the homologs of their favorite genes. Most laboratories utilized Sanger's "shotgun" method of sequencing that had become the accepted standard for genome sequencing. S. Cerevisiae had a sequence approximately 60 times larger than any sequence previously attempted indicating why Goffeau felt compelled to invite the cooperation of a group of laboratories. At the time the sequencing of model organisms such as S. Cerevisiae appeared to be the logical step towards the eventual characterization of the human genome, a task that seemed beyond the scope of technology due to its tremendous size of 3,000 Mb. Sequencing smaller genomes would highlight the problems with sequencing techniques eventually refining the technology to be used on large-scale projects like H. Sapiens. In addition, valuable insight concerning these organisms would be gained with the elucidation of their genetic makeup.

The following year saw the initiation of a plethora of ambitious sequencing proposals the foremost being the introduction of the Human Genome Project in 1990. The U.S. Human Genome Project (HGP) is a joint effort of the Department of Energy and the National Institute of Health that was designed as a three-step program to produce genetic maps, physical maps, and finally the complete nucleotide sequence map of the human chromosomes. The first two aims of the project are practically fulfilled and now the majority of work is concentrated on the exact nucleotide sequence of the human. In the wake of this pronouncement came the start of three projects aimed at elucidating the sequences of smaller model organisms, similar to S. Cerevisiae in their academic utility, such as Escherichia. coli, Mycoplasma capricolum, and Caenorhabditis. elegans. It was hoped that these projects would increase the efficiency of sequencing but unfortunately they fell short of this task. Many anticipated that E. coli would be the first genome to be sequenced entirely but to the shock of the science community, an outsider won the race for the first complete genome sequence of a free living organism, Haemophilus influenzae.

A team headed by J. Craig Venter from the Institute for Genomic Research (TIGR) and Nobel laureate Hamilton Smith of Johns Hopkins University, sequenced the 1.8 Mb bacterium with new computational methods developed at TIGR's facility in Gaithersburg, Maryland. Previous sequencing projects had been limited by the lack of adequate computational approaches to assemble the large amount of random sequences produced by "shotgun" sequencing. In conventional sequencing, the genome is broken down laboriously into ordered, overlapping segments, each containing up to 40 Kb of DNA. These segments are "shotgunned" into smaller pieces and then sequenced to reconstruct the genome. Venter's team utilized a more comprehensive approach by "shotgunning" the entire 1.8 Mb H. Influenzae genome. Previously, such an approach would have failed because the software did not exist to assemble such a massive amount of information accurately. Software, developed by TIGR, called the TIGR Assembler was up to the task, reassembling the approximately 24,000 DNA fragments into the whole genome. After the H. Influenzae genome was "shotgunned" and the clones purified sufficiently the TIGR Assembler software required approximately 30 hours of central processing unit time on a SPARCenter 2000 containing half a gigabyte of RAM testifying to the enormous complexity of the computation.

Venter's H. Influenzae project had failed to win funding from the National Institute of Health indicating the serious doubts surrounding his ambitious proposal. It simply was not believed that such an approach could sequence the large 1.8 Mb sequence of the bacterium accurately. Venter proved everyone wrong and succeeded in sequencing the genome in 13 months at a cost of 50 cents per base which was half the cost and drastically faster than conventional sequencing. This new method of sequencing led to a multitude of completed sequences over the ensuing years by TIGR. Mycoplasma Genitalium, a bacterium that is associated with reproductive-tract infections and is renowned for having the shortest genome of all free-living organisms was sequenced by TIGR in a period of eight months between January and August of 1995 an extraordinary example of the efficiency of TIGR's new sequencing method. TIGR subsequently published the first genome sequence of a representative of the Archaea, Methanococcus jannaschii , the first genome sequence of a sulfur-metabolizing organism, Archaeoglobus fulgidus , the genome sequence of the pathogen involved in peptic ulcer disease, Helicobacter pylori , and the genome sequence of the Lyme disease spirochaete, Borrelia burgdorferi.

TIGR's dramatic leadership role in the field of genome sequencing was paralleled by the final completion of two of the largest genomic sequences , the bacterium E. Coli K-12 , and the yeast, S. Cerevisiae in 1997. These projects were the culmination of over seven years of intensive work. The yeast genome was the final result of a tremendous international collaboration of more than 600 scientists from over 100 laboratories representing the largest decentralised experiment in modern molecular biology. The final work represented efforts of scientist from Japan, Europe, Canada, and the United States producing the largest full length sequence (12 Mb) ever done. In an incredible display of organizational mastery only 3.4% of the total sequencing efforts were duplicated among laboratories. The E.Coli sequence was considerably smaller (4.6 Mb) but equally important in terms of experimental utility. E. Coli is the preferred model in biochemical genetics, molecular biology, and biotechnology and its genomic characterization will undoubtedly further research toward a more complete understanding of this important experimental, medical, and industrial organism.

At the close of 1997, we are halfway through the time allotted for completing the Human Genome Project projected to finish on September 30, 2005 approximately fifty years after the landmark paper of Watson and Crick. Currently major groups have sequenced approximately 50 Mb of human DNA representing less than 1.5% of the 3,000 Mb genome. The estimated finish of the human genome by the year 2,000 appears quite optimistic considering that the world's large-scale sequencing capacity is approximately 100 Mb per year. To complete the genome the average production must increase to 400 Mb per year. Several factors including the slow rate of Sanger sequencing and the high accuracy goal of the HGP which allows for one error in 10,000 bases limits the ability of researchers to proceed more quickly. Advancements in Sanger sequencing or possible replacements for this time intensive process will be necessary to ensure the HGP's goal of completion by the year 2005.

As of September of 1997, thirteen genome sequences of free-living organisms had been completed including the two largest, E. Coli and yeast, and eleven other microbial genomes under the length of 4.2 Mb. Four other large-scale projects are in progress including the sequencing of the Nematode, C. Elegans, which is 71% completed, the fruit fly, Drosophola Melanogaster which is 6% completed, the mouse which has less than 1% finished, and the human which is only 1.5% completed. These statistics are impressive considering that only four years ago no completed sequences existed.

The rapid proliferation of biological information in the form of genome sequences has been the major factor in the creation of the field of bioinformatics, that focuses on the acquisition, storage, access, analysis, modeling, and distribution of the many types of information embedded in DNA sequences. This field will be challenged by the heightening demands of increased information on the algorithms currently utilized for sequence manipulation. The growing sequence knowledge of the human genome has been likened to the establishment of the periodic table in the 19th century. Just as past chemists systematically organized all elements in an array that captured their differences and similarities, the Human Genome Project will allow modern scientists to construct a biological periodic table relating units of nucleotides. The periodic table will not contain 100 elements, but 100,000 genes reflecting not their similarity in electronic configuration but their evolutionary and functional relationship. Bioinformatics will be the tool of the modern scientist in interpreting this periodic table of biological information.

For any comments on the paper email the author: Edmund Pillsbury.

Cool Genome Sequencing Sites

To see the pioneers of genomic sequencing check out The Institute for Genomic Research
and their private affiliate Human Genome Sciences. Sequences for Haemophilus influenzae, Mycoplasma Genitalium, Methanococcus jannaschii, Archaeoglobus fulgidus, Helicobacter pylori , and Borrelia burgdorferi can be found at the TIGR site along with links to their papers.

Non-commercial sequencing projects exceeding 1Mb of production (listed from the largest to smallest):

1) Sanger Center (UK), named after the guy that started it all.
2) Genome Sequencing Center at Washington University
3) University of Oklahoma
4) Baylor College of Medicine
5) Whitehead Institute
6) Institute for Molecular Biology (Jena, Germany)
7) University of Washington
8) University of Texas Southwestern Medical Center (Dallas)

Government sequence databases :

1) The National Center for Biotechnology Information (NCBI)---This is a great resource. . . includes GenBank the federal sequence repository where everyone submits sequences.
2) Genome Sequence Database (GSDB) at the National Center for Genome Resources (Sante Fe, New Mexico).
3) The Genome Data Base (GDB)---worldwide repository for mapping information.

Yeast Databases:

1) Munich Information Center for Protein Sequences (MIPS)
2) Yeast Protein Database (YPD)
3) Saccharomyces Genome Database (SGD) at Stanford University
4) SWISS-PROT, University of Genevea, Switzerland
5) GeneQuiz, European Molecular Biology Laboratory, Heidelberg, Germany
6) NIH Yeast Information Page
7) Schizosaccharomyces pombe
8) Candida albicans
9) XREFdb, National Center for Biological Information, Baltimore, MD

Sequencing projects in progress:

1) Drosophila melanogaster sequencing status.
2) Caenorhabditis. elegans sequencing status.

List compiled by Edmund Pillsbury, any problems or questions please email.