Rita V.M. Rio                                                                           MB&B 752a Final project

December 15, 2000

 

Bioinformatic applications in infectious disease: revealing the molecular secrets of microbial pathogens and parasites

 

 

            Advances in DNA sequencing technology has resulted in the high throughput of an enormous quantity of molecular sequencing information.  Since the publication of the Haemophilus influenzae genome (Fleischmann, R.D. et al.; 1995), complete sequences for 27 more microbial genome sequences and 3 lower eukaryotic chromosome sequences have been published, and many more genome projects are underway (Fraser et al.; 2000).  A current list of complete and ongoing projects is located at (http://216.190.101.28/GOLD/).  Bioinformatics provides the necessary and important computational tools for the data analysis, management, and organization of the vast amount of available sequence information enabling critical and effective use of data.  Bioinformatic applications towards infectious disease research can lead to a better understanding of specific mechanisms of pathogenesis and host adaptation.  Furthermore, bioinformatics can provide a catalyst for progress in the battle against infectious diseases caused by microbes and parasites by facilitating the identification of candidates for virulence genes, vaccines, and chemotherapeutic agents.

            Upon the completion of a genome sequence, the next major task is the identification and characterization of protein coding regions.  Gene identification programs utilize a range of mathematical techniques, such as neural nets, Markov chain analysis, hidden Markov models, dynamic programming and linguistic analysis for prediction purposes (Bhattacharya, A. et al.; 2000).  Presently, there are numerous gene-finding methods with particular methods being better suited for specific organisms or groups of organisms (e.g. prokaryotes, lower eukaryotes, higher eukaryotes, mammals, etc.).  The GeneMark algorithm has been used to discover genes in the completed genome projects of Mycoplasma genitalium and H. influenzae (http://genemark.biology.gatech.edu/GeneMark).  Various orders of Markov models

can be used to calculate the probabilities of oligonucleotides, taking into account correlations between nucleotide frequencies in different positions of the sequence.  Because correlation between nucleotides differs in coding and noncoding sequences, the corresponding Markov models also differ (Bhattacharya, A. et al.; 2000).  The GeneMark algorithm accumulates a detectable signal within the rather long bacterial gene even when a relatively weak zero-order Markov model is used.  Yet, for shorter DNA sequences, the higher order models are known to be more accurate in coding potential detection (Borodovsky et al.; 1999).  The GLIMMER program is being used for the analysis of Plasmodium genomes (http://www.tigr.org/~salzberg/glimmer-nar.pdf).  The GLIMMER algorithm uses interpolated Markov models to compute the a posteriori probability of coding function (Salzberg et al.; 1998).  The TESTCODE program, which identifies coding regions based on compositional variations in coding versus noncoding regions, is being used for the examination of the Leishmania genomes. 

            The identification of untranslated regulatory regions is also of importance in sequence data analysis.  New virulence factors are often discovered by their coregulation with known virulence factors (Taylor et al.; 1987).  Motifs associated with binding sites for regulators can be identified in regulatory regions of genes involved in pathogenesis.  These signature patterns can subsequently be used to search for other regions containing these patterns.  In addition, LIGPLOT and NUCPLOT programs (http://www.biochem.ucl/ac/uk.bsmm/pdbsum) can identify protein/ligand and protein/nucleic acid interactions, respectively. 

A comparative analysis approach can be employed to associate genome sequences of various pathogenic species or strains.  Comparing nucleotide or amino acid sequences in databases (e.g., GenBank, EMBL, DDBJ, PDB, Swiss-Prot) utilizing one of a number of alignment programs (e.g. Fasta, BLAST) identifies matches to entire known databases of genes and/or proteins.  For example, the identification of genes in Treponema pallidum whose products exhibit homologies to haemolysins or cytoskeleton-interacting proteins assists our understanding of the pathogenesis of syphilis (Frosch et al., 1998).  Recently, the Institute for Genomic Research has created the Comprehensive Microbial Resource (CMR) database to facilitate comparative genomics studies on completed genome sequences.  CMR (http://www.tigr.org) includes information on; sequence and annotation (e.g. taxon, Gram stain pattern,etc.), structure and composition of DNA molecules (e.g. plasmid vs. chromosomal, GC content, etc.), and many attributes of protein sequences (e.g. pI, molecular weight, etc) (Fraser et al.; 2000). 

            Although the whole gene comparison approach is useful in recognizing good candidates among genes whose functions have been described, it is not particularly useful in discovering new virulence functions particularly of less well-studied pathogens.  With these pathogens, databases that do not search for matches to whole genes or proteins but utilize motifs can prove helpful.  BLOCKS, ProDom, PROSITE, are databases consisting of motif collections.  A motif is stringent enough to retrieve the family members in a complete protein database.  Thus, hits to these databases are based on these motifs regions and do not require extensive similarity elsewhere in the sequence, as may be the case with whole-gene matches (Weinstock; 2000). 

            Of particular interest in pathogenesis studies are proteins involved in host interactions, most likely virulence factors.  The majority of these proteins are localized to the cell surface or secreted (Weinstock; 2000).  Transmembrane regions can be identified with various programs such as PHD and MEMSAT.  Other membrane proteins, such as those involved in cell adhesion and in the generation of antigenic variation, are important for pathogen survival.  In addition, the transport capacity regulates the metabolic potential and dictates the range of tissues where a pathogen can inhabit (Fraser et al., 2000).  Thus, pathogenic proteins involved in transport and cell adhesion, nonhomologous to a host, are excellent targets for the development of antipathogenic agents and other interventional strategies.  Protein structural information can also be utilized in pharmaceutical designs.  Protein structure classification and fold information can be obtained with programs such as, CATH, CATHWheels, and SAS (http://www.biochem.ucl.ac.uk/bsm).  These programs can supplement protein sequence results with structural features.  Perhaps, a fold unique to a pathogen can be identified and inhibitory drugs with specific action at this fold can be synthesized. 

            Among the few uses of the plethora of bioinformatic computational tools available includes; the identification of protein coding regions, untranslated regulatory regions, genomic comparison analysis, protein structure classification, protein/protein interaction, and protein/nucleic acid interaction.  All this data can be integrated to provide greater biological significance to sequencing results, providing further direction for research and experiments.  A pathogen genome sequence is the “parts list “ that can be exploited for intervention purposes.  Bioinformatics facilitates the search for gene products useful for vaccine and drug design to combat microbial and parasitic diseases. 

 

Literature cited

 

Bhattacharya, A., S. Bhattacharya, A. Joshi, S. Ramachandran, and R. Ramaswamy. 

2000.  Identification of parasitic genes by computational methods.  Parasitology Today 16(3): 127-131.

Borodovsky, M., W.S. Hayes, and A.V. Lukashin.  Statistical predictions of coding

regions in prokaryotic genomes by using inhomogeneous Markov models.  In Organization of the prokaryotic genome.  R.L. Charlebois (Ed.).  1999.  ASM Press, Washington D.C. pp. 11-33.

Fleischmann R.D., M.D. Adams, O. White, R.A. Clayton, E.F. Kirkness, A.R. Kerlavage,

C.J. Bult, J.-F. Tomb, B.A. Dougherty, J.M. Merrick, K. McKenney, G. Sutton, W. Fitzhugh, C. Fields, J.D. Gocayne, J. Scott, R. Shirley, L.-I. Liu, A. Glodek, J.M. Kelley, J.F. Weidman, C.A. Phillips, T. Spriggs, E. Hedblom, M.D. Cotton, T.R. Utterback, M.C. Hanna, D.T. Nguyen, D.M. Saudek, R.C. Brandon, L.D. Fine, J.L. Fritchman, J.L. Fuhrmann, N.S.M. Geoghagen, C.L. Gnehm, L.A. McDonald, K.V. Small, C.M. Fraser, H.O. Smith, and J.C. Venter.  1995.  Whole-genome random sequencing and assembly of Haemophilus influenzae.  Science 269: 496-512. 

Fraser, C.M., J. Eisen, R.D. Fleischmann, K.A. Ketchum, and S. Peterson.  Comparative

genomics and understanding microbial biology.  Emerging Infectious Diseases 6(5): 505-512.

Frosch, M., J. Reidl, and U. Vogel.  1998.  Genomics in infectious diseases: approaching

the pathogens.  Trends in Microbiology 6(9):  346-349.

Salzberg, S.L., A.L. Delcher, S. Kasif, and O. White.  1998.  Microbial gene

identification using interpolated Markov models.  Nucleic Acids

Research 26: 544-548.

Taylor, R.K., V.L. Miller, D.B. Furlong, and J.J. Mekalanos.  1987.  Use of phoA gene

fusions to identify a pilus colonization factor coordinately regulated with cholera toxin.  PNAS 84: 2833-2837.

Weinstock, G.M.  2000.  Genomics and bacterial pathogenesis.  Emerging Infectious

Diseases 6(5): 496-504.