Rita V.M.
Rio MB&B
752a Final project
December
15, 2000
Bioinformatic applications in infectious disease:
revealing the molecular secrets of microbial pathogens and parasites
Advances in DNA sequencing
technology has resulted in the high throughput of an enormous quantity of
molecular sequencing information. Since
the publication of the Haemophilus influenzae
genome (Fleischmann, R.D. et al.;
1995), complete sequences for 27 more microbial genome sequences and 3 lower
eukaryotic chromosome sequences have been published, and many more genome
projects are underway (Fraser et al.;
2000). A current list of complete and
ongoing projects is located at (http://216.190.101.28/GOLD/). Bioinformatics provides the necessary and
important computational tools for the data analysis, management, and
organization of the vast amount of available sequence information enabling
critical and effective use of data.
Bioinformatic applications towards infectious disease research can lead
to a better understanding of specific mechanisms of pathogenesis and host
adaptation. Furthermore, bioinformatics
can provide a catalyst for progress in the battle against infectious diseases
caused by microbes and parasites by facilitating the identification of
candidates for virulence genes, vaccines, and chemotherapeutic agents.
Upon the completion of a genome
sequence, the next major task is the identification and characterization of
protein coding regions. Gene
identification programs utilize a range of mathematical techniques, such as
neural nets, Markov chain analysis, hidden Markov models, dynamic programming
and linguistic analysis for prediction purposes (Bhattacharya, A. et al.;
2000). Presently, there are numerous
gene-finding methods with particular methods being better suited for specific
organisms or groups of organisms (e.g. prokaryotes, lower eukaryotes, higher
eukaryotes, mammals, etc.). The
GeneMark algorithm has been used to discover genes in the completed genome
projects of Mycoplasma genitalium and H. influenzae (http://genemark.biology.gatech.edu/GeneMark). Various orders of Markov models
can
be used to calculate the probabilities of oligonucleotides, taking into account
correlations between nucleotide frequencies in different positions of the
sequence. Because correlation between
nucleotides differs in coding and noncoding sequences, the corresponding Markov
models also differ (Bhattacharya, A. et al.; 2000). The GeneMark algorithm accumulates a
detectable signal within the rather long bacterial gene even when a relatively
weak zero-order Markov model is used.
Yet, for shorter DNA sequences, the higher order models are known to be
more accurate in coding potential detection (Borodovsky et al.;
1999). The GLIMMER program is being
used for the analysis of Plasmodium genomes (http://www.tigr.org/~salzberg/glimmer-nar.pdf). The GLIMMER algorithm uses interpolated
Markov models to compute the a posteriori probability of coding function
(Salzberg et al.; 1998). The
TESTCODE program, which identifies coding regions based on compositional
variations in coding versus noncoding regions, is being used for the
examination of the Leishmania genomes.
The identification of untranslated
regulatory regions is also of importance in sequence data analysis. New virulence factors are often discovered
by their coregulation with known virulence factors (Taylor et al.;
1987). Motifs associated with binding
sites for regulators can be identified in regulatory regions of genes involved
in pathogenesis. These signature
patterns can subsequently be used to search for other regions containing these
patterns. In addition, LIGPLOT and
NUCPLOT programs (http://www.biochem.ucl/ac/uk.bsmm/pdbsum)
can identify protein/ligand and protein/nucleic acid interactions,
respectively.
A comparative analysis approach can be employed to
associate genome sequences of various pathogenic species or strains. Comparing nucleotide or amino acid sequences
in databases (e.g., GenBank, EMBL, DDBJ, PDB, Swiss-Prot) utilizing one of a
number of alignment programs (e.g. Fasta, BLAST) identifies matches to entire
known databases of genes and/or proteins.
For example, the identification of genes in Treponema pallidum whose products exhibit homologies to haemolysins
or cytoskeleton-interacting proteins assists our understanding of the pathogenesis
of syphilis (Frosch et al.,
1998). Recently, the Institute for
Genomic Research has created the Comprehensive Microbial Resource (CMR)
database to facilitate comparative genomics studies on completed genome
sequences. CMR (http://www.tigr.org) includes information on;
sequence and annotation (e.g. taxon, Gram stain pattern,etc.), structure and
composition of DNA molecules (e.g. plasmid vs. chromosomal, GC content, etc.),
and many attributes of protein sequences (e.g. pI, molecular weight, etc)
(Fraser et al.; 2000).
Although the whole gene comparison
approach is useful in recognizing good candidates among genes whose functions
have been described, it is not particularly useful in discovering new virulence
functions particularly of less well-studied pathogens. With these pathogens, databases that do not
search for matches to whole genes or proteins but utilize motifs can prove
helpful. BLOCKS, ProDom, PROSITE, are
databases consisting of motif collections.
A motif is stringent enough to retrieve the family members in a complete
protein database. Thus, hits to these
databases are based on these motifs regions and do not require extensive
similarity elsewhere in the sequence, as may be the case with whole-gene
matches (Weinstock; 2000).
Of particular interest in
pathogenesis studies are proteins involved in host interactions, most likely
virulence factors. The majority of
these proteins are localized to the cell surface or secreted (Weinstock; 2000). Transmembrane regions can be identified with
various programs such as PHD and MEMSAT.
Other membrane proteins, such as those involved in cell adhesion and in
the generation of antigenic variation, are important for pathogen
survival. In addition, the transport
capacity regulates the metabolic potential and dictates the range of tissues
where a pathogen can inhabit (Fraser et al., 2000). Thus, pathogenic proteins involved in
transport and cell adhesion, nonhomologous to a host, are excellent targets for
the development of antipathogenic agents and other interventional
strategies. Protein structural
information can also be utilized in pharmaceutical designs. Protein structure classification and fold
information can be obtained with programs such as, CATH, CATHWheels, and SAS (http://www.biochem.ucl.ac.uk/bsm). These programs can supplement protein
sequence results with structural features.
Perhaps, a fold unique to a pathogen can be identified and inhibitory
drugs with specific action at this fold can be synthesized.
Among the few uses of the plethora
of bioinformatic computational tools available includes; the identification of
protein coding regions, untranslated regulatory regions, genomic comparison analysis,
protein structure classification, protein/protein interaction, and
protein/nucleic acid interaction. All
this data can be integrated to provide greater biological significance to
sequencing results, providing further direction for research and experiments. A pathogen genome sequence is the “parts
list “ that can be exploited for intervention purposes. Bioinformatics facilitates the search for
gene products useful for vaccine and drug design to combat microbial and
parasitic diseases.
Bhattacharya,
A., S. Bhattacharya, A. Joshi, S. Ramachandran, and R. Ramaswamy.
2000.
Identification of parasitic genes by computational methods. Parasitology Today 16(3): 127-131.
Fleischmann
R.D., M.D. Adams, O. White, R.A. Clayton, E.F. Kirkness, A.R. Kerlavage,
C.J. Bult, J.-F. Tomb, B.A. Dougherty, J.M. Merrick,
K. McKenney, G. Sutton, W. Fitzhugh, C. Fields, J.D. Gocayne, J. Scott, R.
Shirley, L.-I. Liu, A. Glodek, J.M. Kelley, J.F. Weidman, C.A. Phillips, T.
Spriggs, E. Hedblom, M.D. Cotton, T.R. Utterback, M.C. Hanna, D.T. Nguyen, D.M.
Saudek, R.C. Brandon, L.D. Fine, J.L. Fritchman, J.L. Fuhrmann, N.S.M.
Geoghagen, C.L. Gnehm, L.A. McDonald, K.V. Small, C.M. Fraser, H.O. Smith, and
J.C. Venter. 1995. Whole-genome random sequencing and assembly
of Haemophilus influenzae.
Science 269: 496-512.
Salzberg,
S.L., A.L. Delcher, S. Kasif, and O. White.
1998. Microbial gene
identification using interpolated Markov
models. Nucleic Acids
Research 26: 544-548.
Taylor,
R.K., V.L. Miller, D.B. Furlong, and J.J. Mekalanos. 1987. Use of phoA gene
fusions to identify a pilus colonization factor coordinately regulated with cholera toxin. PNAS 84: 2833-2837.
Weinstock,
G.M. 2000. Genomics and bacterial pathogenesis. Emerging Infectious
Diseases 6(5): 496-504.