A History of Genome Sequencing:
The sequencing of the human genome along with related organisms
represents one of the largest scientific endeavors in the history of
mankind. The information garnered from sequencing will provide the
raw data for the exploding field of bioinformatics, where computer
science and biology live in symbiotic harmony. The large scale
sequencing proposed by the Human Genome Project in 1990 could never
have been a reality without modern computer facilities. Merely,
twenty years ago computers would have been powerless in light of such
a daunting amount and array of data. Homologue identification and
genome characterization between organisms constituting millions of
nucleotides was unimaginable until the rapid advancement of
microchips and processors over the past two decades. In addition the
first sequenced genome of a live organism, Haemophilus
influenzae, would have been impossible without the computational
methods developed at the facilities of The Institute for Genetic
Research (TIGR). While this is not technically an aspect of
bioinformatics, this development would have been impossible with the
computers of yesterday. The short history of genome sequencing began
with Frederic Sanger's invention of sequencing almost twenty-five
years ago.
The art of determining the sequence of DNA is known as Sanger
sequencing after its brilliant pioneer. This technique involves the
separation of flourescent labeled DNA fragments according to their
length on a polyacrilimide gel (PAGE). The base at the end of each
fragment can then be visualized and identified by the dye with which
it reacts. The time and labor intensive nature of gel preparation and
running, as well as the large amounts of sample required, increase
the time and costs of genomic sequencing. These conditions
drastically reduce the efficiency of sequencing projects ultimately
limiting researchers in their sequencing attempts.
Bacteriophage fX174, was the first genome to be sequenced, a viral
genome with only 5,368 base pairs (bp). Frederic Sanger, in another
revolutionary discovery, invented the method of "shotgun" sequencing,
a strategy based on the isolation of random pieces of DNA from the
host genome to be used as primers for the PCR amplification of the
entire genome. The amplified portions of DNA are then assembled by
their overlapping regions to form contiguous transcripts (otherwise
known as contigs). The final step involved the utilization of custom
primers to elucidate the gaps between the contigs thus giving the
completely sequenced genome. Sanger first used "shotgun" sequencing
five years later to complete the bacteriophage l sequence that was
significantly larger, 48,502 bp. This method allowed sequencing
projects to proceed at a much faster rate thus expanding the scope of
realistic sequencing venture. Since then a couple of other viral and
organellar genomes have been sequenced using similar techniques such
as the 229 kb genome of cytomegalovirus (CMV), the 192 kb genome of
vaccinia, and the 187 kb mitochondrial and the 121 kb chloroplast
genomes of Marchantia polymorpha, and the 186 kb genome of
smallpox.
The success with viral genome sequencing stemmed from the relatively
small length of their genetic codes. In 1989, Andre Goffeau set up a
European consortium to sequence the genome of the budding yeast
Saccharomyces cerevisiae (12.5 Mb). Goffeau's European
collaboration involved 74 different laboratories drawn to the project
in hopes of sequencing the homologs of their favorite genes. Most
laboratories utilized Sanger's "shotgun" method of sequencing that
had become the accepted standard for genome sequencing. S.
Cerevisiae had a sequence approximately 60 times larger than any
sequence previously attempted indicating why Goffeau felt compelled
to invite the cooperation of a group of laboratories. At the time the
sequencing of model organisms such as S. Cerevisiae appeared
to be the logical step towards the eventual characterization of the
human genome, a task that seemed beyond the scope of technology due
to its tremendous size of 3,000 Mb. Sequencing smaller genomes would
highlight the problems with sequencing techniques eventually refining
the technology to be used on large-scale projects like H.
Sapiens. In addition, valuable insight concerning these organisms
would be gained with the elucidation of their genetic makeup.
The following year saw the initiation of a plethora of ambitious
sequencing proposals the foremost being the introduction of the Human
Genome Project in 1990. The U.S. Human Genome Project (HGP) is a
joint effort of the Department of Energy and the National Institute
of Health that was designed as a three-step program to produce
genetic maps, physical maps, and finally the complete nucleotide
sequence map of the human chromosomes. The first two aims of the
project are practically fulfilled and now the majority of work is
concentrated on the exact nucleotide sequence of the human. In the
wake of this pronouncement came the start of three projects aimed at
elucidating the sequences of smaller model organisms, similar to
S. Cerevisiae in their academic utility, such as
Escherichia. coli, Mycoplasma capricolum, and
Caenorhabditis. elegans. It was hoped that these projects
would increase the efficiency of sequencing but unfortunately they
fell short of this task. Many anticipated that E. coli would
be the first genome to be sequenced entirely but to the shock of the
science community, an outsider won the race for the first complete
genome sequence of a free living organism, Haemophilus
influenzae.
A team headed by J. Craig Venter from the Institute for Genomic
Research (TIGR) and Nobel laureate Hamilton Smith of Johns Hopkins
University, sequenced the 1.8 Mb bacterium with new computational
methods developed at TIGR's facility in Gaithersburg, Maryland.
Previous sequencing projects had been limited by the lack of adequate
computational approaches to assemble the large amount of random
sequences produced by "shotgun" sequencing. In conventional
sequencing, the genome is broken down laboriously into ordered,
overlapping segments, each containing up to 40 Kb of DNA. These
segments are "shotgunned" into smaller pieces and then sequenced to
reconstruct the genome. Venter's team utilized a more comprehensive
approach by "shotgunning" the entire 1.8 Mb H. Influenzae
genome. Previously, such an approach would have failed because the
software did not exist to assemble such a massive amount of
information accurately. Software, developed by TIGR, called the TIGR
Assembler was up to the task, reassembling the approximately 24,000
DNA fragments into the whole genome. After the H. Influenzae
genome was "shotgunned" and the clones purified sufficiently the
TIGR Assembler software required approximately 30 hours of central
processing unit time on a SPARCenter 2000 containing half a gigabyte
of RAM testifying to the enormous complexity of the computation.
Venter's H. Influenzae project had failed to win funding from
the National Institute of Health indicating the serious doubts
surrounding his ambitious proposal. It simply was not believed that
such an approach could sequence the large 1.8 Mb sequence of the
bacterium accurately. Venter proved everyone wrong and succeeded in
sequencing the genome in 13 months at a cost of 50 cents per base
which was half the cost and drastically faster than conventional
sequencing. This new method of sequencing led to a multitude of
completed sequences over the ensuing years by TIGR. Mycoplasma
Genitalium, a bacterium that is associated with
reproductive-tract infections and is renowned for having the shortest
genome of all free-living organisms was sequenced by TIGR in a period
of eight months between January and August of 1995 an extraordinary
example of the efficiency of TIGR's new sequencing method. TIGR
subsequently published the first genome sequence of a representative
of the Archaea, Methanococcus jannaschii , the first
genome sequence of a sulfur-metabolizing organism, Archaeoglobus
fulgidus , the genome sequence of the pathogen involved in peptic
ulcer disease, Helicobacter pylori , and the genome sequence
of the Lyme disease spirochaete, Borrelia burgdorferi.
TIGR's dramatic leadership role in the field of genome sequencing was
paralleled by the final completion of two of the largest genomic
sequences , the bacterium E. Coli K-12 , and the yeast, S.
Cerevisiae in 1997. These projects were the culmination of over
seven years of intensive work. The yeast genome was the final result
of a tremendous international collaboration of more than 600
scientists from over 100 laboratories representing the largest
decentralised experiment in modern molecular biology. The final work
represented efforts of scientist from Japan, Europe, Canada, and the
United States producing the largest full length sequence (12 Mb) ever
done. In an incredible display of organizational mastery only 3.4% of
the total sequencing efforts were duplicated among laboratories. The
E.Coli sequence was considerably smaller (4.6 Mb) but equally
important in terms of experimental utility. E. Coli is the
preferred model in biochemical genetics, molecular biology, and
biotechnology and its genomic characterization will undoubtedly
further research toward a more complete understanding of this
important experimental, medical, and industrial organism.
At the close of 1997, we are halfway through the time allotted for
completing the Human Genome Project projected to finish on September
30, 2005 approximately fifty years after the landmark paper of Watson
and Crick. Currently major groups have sequenced approximately 50 Mb
of human DNA representing less than 1.5% of the 3,000 Mb genome. The
estimated finish of the human genome by the year 2,000 appears quite
optimistic considering that the world's large-scale sequencing
capacity is approximately 100 Mb per year. To complete the genome the
average production must increase to 400 Mb per year. Several factors
including the slow rate of Sanger sequencing and the high accuracy
goal of the HGP which allows for one error in 10,000 bases limits the
ability of researchers to proceed more quickly. Advancements in
Sanger sequencing or possible replacements for this time intensive
process will be necessary to ensure the HGP's goal of completion by
the year 2005.
As of September of 1997, thirteen genome sequences of free-living
organisms had been completed including the two largest, E.
Coli and yeast, and eleven other microbial genomes under the
length of 4.2 Mb. Four other large-scale projects are in progress
including the sequencing of the Nematode, C. Elegans, which is
71% completed, the fruit fly, Drosophola Melanogaster which is
6% completed, the mouse which has less than 1% finished, and the
human which is only 1.5% completed. These statistics are impressive
considering that only four years ago no completed sequences
existed.
The rapid proliferation of biological information in the form of
genome sequences has been the major factor in the creation of the
field of bioinformatics, that focuses on the acquisition, storage,
access, analysis, modeling, and distribution of the many types of
information embedded in DNA sequences. This field will be challenged
by the heightening demands of increased information on the algorithms
currently utilized for sequence manipulation. The growing sequence
knowledge of the human genome has been likened to the establishment
of the periodic table in the 19th century. Just as past chemists
systematically organized all elements in an array that captured their
differences and similarities, the Human Genome Project will allow
modern scientists to construct a biological periodic table relating
units of nucleotides. The periodic table will not contain 100
elements, but 100,000 genes reflecting not their similarity in
electronic configuration but their evolutionary and functional
relationship. Bioinformatics will be the tool of the modern scientist
in interpreting this periodic table of biological information.
For any comments on the paper email the author:
Edmund Pillsbury.
List compiled by Edmund Pillsbury, any problems or
questions please email.