Notes
Outline
BIOINFORMATICS
Introduction
Mark Gerstein, Yale University
bioinfo.mbb.yale.edu/mbb452a
Bioinformatics
What is Bioinformatics?
(Molecular) Bio - informatics
One idea for a definition?
Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules,  on a large-scale.
Bioinformatics is “MIS” for Molecular Biology Information. It is a practical discipline with many applications.
What is the Information?
Molecular Biology as an Information Science
Central Dogma
of Molecular Biology
 
DNA
 -> RNA
  -> Protein
   -> Phenotype
    -> DNA
Molecules
Sequence, Structure, Function
Processes
Mechanism, Specificity, Regulation
Central Paradigm
for Bioinformatics

Genomic Sequence Information
 -> mRNA (level)
  -> Protein Sequence
   -> Protein Structure
    -> Protein Function
     -> Phenotype
Large Amounts of Information
Standardized
Statistical
Molecular Biology Information - DNA
Raw DNA Sequence
Coding or Not?
Parse into genes?
4 bases: AGCT
~1 K in a gene, ~2 M in genome
Molecular Biology Information: Protein Sequence
20 letter alphabet
ACDEFGHIKLMNPQRSTVWY  but not BJOUXZ
Strings of  ~300 aa in an average protein (in bacteria),
 ~200 aa in a domain
~200 K known protein sequences
Molecular Biology Information:
Macromolecular Structure
DNA/RNA/Protein
Almost all protein
(RNA Adapted From D Soll Web Page,
Right Hand Top Protein from M Levitt web page)
Molecular Biology Information: Protein Structure Details
Statistics on Number of XYZ triplets
200 residues/domain -> 200 CA atoms, separated by 3.8 A
Avg. Residue is Leu: 4 backbone atoms + 4 sidechain atoms, 150 cubic A
=>  ~1500 xyz triplets (=8x200) per protein domain
10 K known domain, ~300 folds
Molecular Biology Information:
Whole Genomes
The Revolution Driving Everything
Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F., Kerlavage, A. R., Bult, C. J., Tomb, J. F., Dougherty, B. A., Merrick, J. M., McKenney, K., Sutton, G., Fitzhugh, W., Fields, C., Gocayne, J. D., Scott, J., Shirley, R., Liu, L. I., Glodek, A., Kelley, J. M., Weidman, J. F., Phillips, C. A., Spriggs, T., Hedblom, E., Cotton, M. D., Utterback, T. R., Hanna, M. C., Nguyen, D. T., Saudek, D. M., Brandon, R. C., Fine, L. D., Fritchman, J. L., Fuhrmann, J. L., Geoghagen, N. S. M., Gnehm, C. L., McDonald, L. A., Small, K. V., Fraser, C. M., Smith, H. O. & Venter, J. C. (1995). "Whole-genome random sequencing and assembly of Haemophilus influenzae rd." Science 269: 496-512.
(Picture adapted from TIGR website, http://www.tigr.org)
Integrative Data
1995, HI (bacteria): 1.6 Mb & 1600 genes done
1997, yeast: 13 Mb & ~6000 genes for yeast
1998, worm: ~100Mb with 19 K genes
1999: >30 completed genomes!
2003, human: 3 Gb & 100 K genes...
Slide 10
Gene Expression Datasets: the Transcriptosome
Array Data
Other Whole-Genome Experiments
Molecular Biology Information:
Other Integrative Data
Information to understand genomes
Metabolic Pathways (glycolysis), traditional biochemistry
Regulatory Networks
Whole Organisms Phylogeny, traditional zoology
Environments, Habitats, ecology
The Literature (MEDLINE)
The Future....
(Pathway drawing from P Karp’s EcoCyc, Phylogeny from S J Gould, Dinosaur in a Haystack)
What is Bioinformatics?
(Molecular) Bio - informatics
One idea for a definition?
Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules,  on a large-scale.
Bioinformatics is “MIS” for Molecular Biology Information. It is a practical discipline with many applications.
Large-scale Information:
GenBank Growth
Large-scale Information:
Explonential Growth of Data Matched by Development of Computer Technology
CPU vs Disk & Net
As important as the increase in computer speed has been, the ability to store large amounts of information on computers is even more crucial
Driving Force in Bioinformatics
(Internet picture adapted
from D Brutlag, Stanford)
Bioinformatics is born!
Weber Cartoon
What is Bioinformatics?
(Molecular) Bio - informatics
One idea for a definition?
Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules,  on a large-scale.
Bioinformatics is “MIS” for Molecular Biology Information. It is a practical discipline with many applications.
Organizing
Molecular Biology Information:
Redundancy and Multiplicity
Different Sequences Have the Same Structure
Organism has many similar genes
Single Gene May Have Multiple Functions
Genes are grouped into Pathways
Genomic Sequence Redundancy due to the Genetic Code
How do we find the similarities?.....
(idea from D Brutlag, Stanford)
Molecular Parts = Conserved Domains, Folds, &c
A Parts List Approach to Bike Maintenance
A Parts List Approach to Bike Maintenance
Vast Growth in (Structural) Data...
but number of Fundementally New (Fold) Parts Not Increasing that Fast
World of Structures is even more Finite,
providing a valuable simplification
What is Bioinformatics?
(Molecular) Bio - informatics
One idea for a definition?
Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules,  on a large-scale.
Bioinformatics is “MIS” for Molecular Biology Information. It is a practical discipline with many applications.
General Types of
“Informatics” techniques
in Bioinformatics
Databases
Building, Querying
Object DB
Text String Comparison
Text Search
1D Alignment
Significance Statistics
Alta Vista, grep
Finding Patterns
AI / Machine Learning
Clustering
Datamining
Geometry
Robotics
Graphics (Surfaces, Volumes)
Comparison and 3D Matching
(Visision, recognition)
Physical Simulation
Newtonian Mechanics
Electrostatics
Numerical Algorithms
Simulation
New Paradigm for
Scientific Computing
Because of
increase in data and improvement in computers, new calculations become possible
But Bioinformatics has a new style of calculation...
Two Paradigms
Physics
Prediction based on physical principles
Exact Determination of Rocket Trajectory
Supercomputer, CPU
Biology
Classifying information and discovering unexpected relationships
globin ~ colicin~ plastocyanin~ repressor
networks, “federated” database
Bioinformatics Topics --
Genome Sequence
Finding Genes in Genomic DNA
introns
exons
promotors
Characterizing Repeats in Genomic DNA
Statistics
Patterns
Duplications in the Genome
Bioinformatics Topics --
Protein Sequence
Sequence Alignment
non-exact string matching, gaps
How to align two strings optimally via Dynamic Programming
Local vs Global Alignment
Suboptimal Alignment
Hashing to increase speed (BLAST, FASTA)
Amino acid substitution scoring matrices
Multiple Alignment and Consensus Patterns
How to align more than one sequence and then fuse the result in a consensus representation
Transitive Comparisons
HMMs, Profiles
Motifs
Scoring schemes and Matching statistics
How to tell if a given alignment or match is statistically significant
A P-value (or an e-value)?
Score Distributions
(extreme val. dist.)
Low Complexity Sequences
Bioinformatics Topics -- Sequence / Structure
Secondary Structure “Prediction”
via Propensities
Neural Networks, Genetic Alg.
Simple Statistics
TM-helix finding
Assessing Secondary Structure Prediction
Tertiary Structure Prediction
Fold Recognition
Threading
Ab initio
Function Prediction
Active site identification
Relation of Sequence Similarity to Structural Similarity
Topics -- Structures
Basic Protein Geometry and Least-Squares Fitting
Distances, Angles, Axes, Rotations
Calculating a helix axis in 3D via fitting a line
LSQ fit of 2 structures
Molecular Graphics
Calculation of Volume and Surface
How to represent a plane
How to represent a solid
How to calculate an area
Docking and Drug Design as Surface Matching
Packing Measurement
Structural Alignment
Aligning sequences on the basis of 3D structure.
DP does not converge, unlike sequences, what to do?
Other Approaches: Distance Matrices, Hashing
Fold Library
Topics -- Databases
Relational Database Concepts
Keys, Foreign Keys
SQL, OODBMS, views, forms, transactions, reports, indexes
Joining Tables, Normalization
Natural Join as "where" selection on cross product
Array Referencing (perl/dbm)
Forms and Reports
Cross-tabulation
Protein Units?
What are the units of biological information?
sequence, structure
motifs, modules, domains
How classified: folds, motions, pathways, functions?
Clustering and Trees
Basic clustering
UPGMA
single-linkage
multiple linkage
Other Methods
Parsimony, Maximum likelihood
Evolutionary implications
The Bias Problem
sequence weighting
sampling
Topics -- Genomics
Expression Analysis
Time Courses clustering
Measuring differences
Identifying Regulatory Regions
Large scale cross referencing of information
Function Classification and Orthologs
The Genomic vs. Single-molecule Perspective
 Genome Comparisons
Ortholog Families, pathways
Large-scale censuses
Frequent Words Analysis
Genome Annotation
Trees from Genomes
Identification of interacting proteins
Structural Genomics
Folds in Genomes, shared & common folds
Bulk Structure Prediction
Genome Trees
Topics -- Simulation
Molecular Simulation
Geometry -> Energy -> Forces
Basic interactions, potential energy functions
Electrostatics
VDW Forces
Bonds as Springs
How structure changes over time?
How to measure the change in a vector (gradient)
Molecular Dynamics & MC
Energy Minimization
Parameter Sets
Number Density
Poisson-Boltzman Equation
Lattice Models and Simplification
Bioinformatics Schematic
Background
Are They or Aren’t They Bioinformatics? (#1)
Digital Libraries
Automated Bibliographic Search and Textual Comparison
Knowledge bases for biological literature
Motif Discovery Using Gibb's Sampling
Methods for Structure Determination
Computational Crystallography
Refinement
NMR Structure Determination
Distance Geometry
Metabolic Pathway Simulation
The DNA Computer
Are They or Aren’t They Bioinformatics? (#1, Answers)
(YES?) Digital Libraries
Automated Bibliographic Search and Textual Comparison
Knowledge bases for biological literature
(YES) Motif Discovery Using Gibb's Sampling
(NO?) Methods for Structure Determination
Computational Crystallography
Refinement
NMR Structure Determination
(YES) Distance Geometry
(YES) Metabolic Pathway Simulation
(NO) The DNA Computer
Are They or Aren’t They Bioinformatics? (#2)
Gene identification by sequence inspection
Prediction of splice sites
DNA methods in forensics
Modeling of Populations of Organisms
Ecological Modeling
Genomic Sequencing Methods
Assembling Contigs
Physical and genetic mapping
Linkage Analysis
Linking specific genes to various traits
Are They or Aren’t They Bioinformatics? (#2, Answers)
(YES) Gene identification by sequence inspection
Prediction of splice sites
(YES) DNA methods in forensics
(NO) Modeling of Populations of Organisms
Ecological Modeling
(NO?) Genomic Sequencing Methods
Assembling Contigs
Physical and genetic mapping
(YES) Linkage Analysis
Linking specific genes to various traits
Are They or Aren’t They Bioinformatics? (#3)
RNA structure prediction
Identification in sequences
Radiological Image Processing
Computational Representations for Human Anatomy (visible human)
Artificial Life Simulations
Artificial Immunology / Computer Security
Genetic Algorithms in molecular biology
Homology modeling
Determination of Phylogenies Based on Non-molecular Organism Characteristics
Computerized Diagnosis based on Genetic Analysis (Pedigrees)
Are They or Aren’t They Bioinformatics? (#3, Answers)
(YES) RNA structure prediction
Identification in sequences
(NO) Radiological Image Processing
Computational Representations for Human Anatomy (visible human)
(NO) Artificial Life Simulations
Artificial Immunology / Computer Security
(NO?) Genetic Algorithms in molecular biology
(YES) Homology modeling
(NO) Determination of Phylogenies Based on Non-molecular Organism Characteristics
(NO) Computerized Diagnosis based on Genetic Analysis (Pedigrees)
What is Bioinformatics?
(Molecular) Bio - informatics
One idea for a definition?
Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules,  on a large-scale.
Bioinformatics is “MIS” for Molecular Biology Information. It is a practical discipline with many applications.
Major Application I:
Designing Drugs
Understanding How Structures Bind Other Molecules (Function)
Designing Inhibitors
Docking, Structure Modeling
(From left to right, figures adapted from Olsen Group Docking Page at Scripps, Dyson NMR Group Web page at Scripps, and from Computational Chemistry Page at Cornell Theory Center).
Major Application II: Finding Homologs
Major Application II:
Finding Homologues
Find Similar Ones in Different Organisms
Human vs. Mouse  vs. Yeast
Easier to do Expts. on latter!
(Section from NCBI Disease Genes Database Reproduced Below.)
Major Application II:
Finding Homologues (cont.)
Cross-Referencing, one thing to another thing
Sequence Comparison and Scoring
Analogous Problems for Structure Comparison
Comparison has two parts:
(1) Optimally Aligning 2 entities to get a Comparison Score
(2) Assessing Significance of this score in a given Context
Integrated Presentation
Align Sequences
Align Structures
Score in a Uniform Framework
Major Application I|I:
Overall Genome Characterization
Overall Occurrence of a Certain Feature in the Genome
e.g. how many kinases in Yeast
Compare Organisms and Tissues
Expression levels in Cancerous vs Normal Tissues
Databases, Statistics
(Clock figures, yeast v. Synechocystis,
adapted from GeneQuiz Web Page, Sander Group, EBI)
Simplfying Genomes with Folds, Pathways, &c
At What Structural Resolution
Are Organisms Different?