|
|
|
Mark Gerstein, Yale University |
|
bioinfo.mbb.yale.edu/mbb452a |
|
|
|
|
|
|
(Molecular) Bio - informatics |
|
One idea for a definition?
Bioinformatics is conceptualizing biology
in terms of molecules (in the sense of physical-chemistry) and then
applying “informatics” techniques (derived from disciplines such as applied
math, CS, and statistics) to understand and organize the information
associated with these molecules, on
a large-scale. |
|
Bioinformatics is “MIS” for Molecular Biology
Information. It is a practical discipline with many applications. |
|
|
|
|
|
Central Dogma
of Molecular Biology
DNA
-> RNA
-> Protein
-> Phenotype
->
DNA |
|
Molecules |
|
Sequence, Structure, Function |
|
Processes |
|
Mechanism, Specificity, Regulation |
|
|
|
|
|
Central Paradigm
for Bioinformatics
Genomic
Sequence Information
-> mRNA (level)
-> Protein Sequence
-> Protein Structure
-> Protein Function
-> Phenotype |
|
|
|
Large Amounts of Information |
|
Standardized |
|
Statistical |
|
|
|
|
|
|
|
Raw DNA Sequence |
|
Coding or Not? |
|
Parse into genes? |
|
4 bases: AGCT |
|
~1 K in a gene, ~2 M in genome |
|
|
|
|
|
|
20 letter alphabet |
|
ACDEFGHIKLMNPQRSTVWY but not BJOUXZ |
|
Strings of
~300 aa in an average protein (in bacteria),
~200 aa in a domain |
|
~200 K known protein sequences |
|
|
|
|
|
|
|
|
|
|
DNA/RNA/Protein |
|
Almost all protein |
|
(RNA Adapted From D Soll Web Page,
Right Hand Top Protein from M Levitt web page) |
|
|
|
|
|
|
|
|
Statistics on Number of XYZ triplets |
|
200 residues/domain -> 200 CA atoms,
separated by 3.8 A |
|
Avg. Residue is Leu: 4 backbone atoms + 4
sidechain atoms, 150 cubic A |
|
=>
~1500 xyz triplets (=8x200) per protein domain |
|
10 K known domain, ~300 folds |
|
|
|
|
|
|
|
The Revolution Driving Everything |
|
Fleischmann, R. D., Adams, M. D., White, O.,
Clayton, R. A., Kirkness, E. F., Kerlavage, A. R., Bult, C. J., Tomb, J.
F., Dougherty, B. A., Merrick, J. M., McKenney, K., Sutton, G., Fitzhugh,
W., Fields, C., Gocayne, J. D., Scott, J., Shirley, R., Liu, L. I., Glodek,
A., Kelley, J. M., Weidman, J. F., Phillips, C. A., Spriggs, T., Hedblom,
E., Cotton, M. D., Utterback, T. R., Hanna, M. C., Nguyen, D. T., Saudek,
D. M., Brandon, R. C., Fine, L. D., Fritchman, J. L., Fuhrmann, J. L.,
Geoghagen, N. S. M., Gnehm, C. L., McDonald, L. A., Small, K. V., Fraser,
C. M., Smith, H. O. & Venter, J. C. (1995). "Whole-genome random
sequencing and assembly of Haemophilus influenzae rd." Science 269:
496-512. |
|
(Picture adapted from TIGR website,
http://www.tigr.org) |
|
Integrative Data |
|
1995, HI (bacteria): 1.6 Mb & 1600 genes
done |
|
1997, yeast: 13 Mb & ~6000 genes for yeast |
|
1998, worm: ~100Mb with 19 K genes |
|
1999: >30 completed genomes! |
|
2003, human: 3 Gb & 100 K genes... |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Information to understand genomes |
|
Metabolic Pathways (glycolysis), traditional
biochemistry |
|
Regulatory Networks |
|
Whole Organisms Phylogeny, traditional zoology |
|
Environments, Habitats, ecology |
|
The Literature (MEDLINE) |
|
The Future.... |
|
|
|
(Pathway drawing from P Karp’s EcoCyc, Phylogeny
from S J Gould, Dinosaur in a Haystack) |
|
|
|
|
(Molecular) Bio - informatics |
|
One idea for a definition?
Bioinformatics is conceptualizing biology
in terms of molecules (in the sense of physical-chemistry) and then
applying “informatics” techniques (derived from disciplines such as applied
math, CS, and statistics) to understand and organize the information
associated with these molecules, on
a large-scale. |
|
Bioinformatics is “MIS” for Molecular Biology
Information. It is a practical discipline with many applications. |
|
|
|
|
|
|
|
CPU vs Disk & Net |
|
As important as the increase in computer speed
has been, the ability to store large amounts of information on computers is
even more crucial |
|
Driving Force in Bioinformatics |
|
|
|
(Internet picture adapted
from D Brutlag, Stanford) |
|
|
|
|
|
|
|
|
(Molecular) Bio - informatics |
|
One idea for a definition?
Bioinformatics is conceptualizing biology
in terms of molecules (in the sense of physical-chemistry) and then
applying “informatics” techniques (derived from disciplines such as applied
math, CS, and statistics) to understand and organize the information
associated with these molecules, on
a large-scale. |
|
Bioinformatics is “MIS” for Molecular Biology
Information. It is a practical discipline with many applications. |
|
|
|
|
Different Sequences Have the Same Structure |
|
Organism has many similar genes |
|
Single Gene May Have Multiple Functions |
|
Genes are grouped into Pathways |
|
Genomic Sequence Redundancy due to the Genetic
Code |
|
How do we find the similarities?..... |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(idea from D Brutlag, Stanford) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(Molecular) Bio - informatics |
|
One idea for a definition?
Bioinformatics is conceptualizing biology
in terms of molecules (in the sense of physical-chemistry) and then
applying “informatics” techniques (derived from disciplines such as applied
math, CS, and statistics) to understand and organize the information
associated with these molecules, on
a large-scale. |
|
Bioinformatics is “MIS” for Molecular Biology
Information. It is a practical discipline with many applications. |
|
|
|
|
|
Databases |
|
Building, Querying |
|
Object DB |
|
Text String Comparison |
|
Text Search |
|
1D Alignment |
|
Significance Statistics |
|
Alta Vista, grep |
|
Finding Patterns |
|
AI / Machine Learning |
|
Clustering |
|
Datamining |
|
Geometry |
|
Robotics |
|
Graphics (Surfaces, Volumes) |
|
Comparison and 3D Matching
(Visision, recognition) |
|
Physical Simulation |
|
Newtonian Mechanics |
|
Electrostatics |
|
Numerical Algorithms |
|
Simulation |
|
|
|
|
|
Because of
increase in data and improvement in computers, new calculations become
possible |
|
But Bioinformatics has a new style of
calculation... |
|
Two Paradigms |
|
|
|
Physics |
|
Prediction based on physical principles |
|
Exact Determination of Rocket Trajectory |
|
Supercomputer, CPU |
|
Biology |
|
Classifying information and discovering
unexpected relationships |
|
globin ~ colicin~ plastocyanin~ repressor |
|
networks, “federated” database |
|
|
|
|
|
Finding Genes in Genomic DNA |
|
introns |
|
exons |
|
promotors |
|
Characterizing Repeats in Genomic DNA |
|
Statistics |
|
Patterns |
|
Duplications in the Genome |
|
|
|
|
|
Sequence Alignment |
|
non-exact string matching, gaps |
|
How to align two strings optimally via Dynamic
Programming |
|
Local vs Global Alignment |
|
Suboptimal Alignment |
|
Hashing to increase speed (BLAST, FASTA) |
|
Amino acid substitution scoring matrices |
|
Multiple Alignment and Consensus Patterns |
|
How to align more than one sequence and then
fuse the result in a consensus representation |
|
Transitive Comparisons |
|
HMMs, Profiles |
|
Motifs |
|
Scoring schemes and Matching statistics |
|
How to tell if a given alignment or match is
statistically significant |
|
A P-value (or an e-value)? |
|
Score Distributions
(extreme val. dist.) |
|
Low Complexity Sequences |
|
|
|
|
|
Secondary Structure “Prediction” |
|
via Propensities |
|
Neural Networks, Genetic Alg. |
|
Simple Statistics |
|
TM-helix finding |
|
Assessing Secondary Structure Prediction |
|
Tertiary Structure Prediction |
|
Fold Recognition |
|
Threading |
|
Ab initio |
|
Function Prediction |
|
Active site identification |
|
Relation of Sequence Similarity to Structural
Similarity |
|
|
|
|
|
|
Basic Protein Geometry and Least-Squares Fitting |
|
Distances, Angles, Axes, Rotations |
|
Calculating a helix axis in 3D via fitting a
line |
|
LSQ fit of 2 structures |
|
Molecular Graphics |
|
Calculation of Volume and Surface |
|
How to represent a plane |
|
How to represent a solid |
|
How to calculate an area |
|
Docking and Drug Design as Surface Matching |
|
Packing Measurement |
|
Structural Alignment |
|
Aligning sequences on the basis of 3D structure. |
|
DP does not converge, unlike sequences, what to
do? |
|
Other Approaches: Distance Matrices, Hashing |
|
Fold Library |
|
|
|
|
|
|
|
|
Relational Database Concepts |
|
Keys, Foreign Keys |
|
SQL, OODBMS, views, forms, transactions,
reports, indexes |
|
Joining Tables, Normalization |
|
Natural Join as "where" selection on
cross product |
|
Array Referencing (perl/dbm) |
|
Forms and Reports |
|
Cross-tabulation |
|
Protein Units? |
|
What are the units of biological information? |
|
sequence, structure |
|
motifs, modules, domains |
|
How classified: folds, motions, pathways,
functions? |
|
Clustering and Trees |
|
Basic clustering |
|
UPGMA |
|
single-linkage |
|
multiple linkage |
|
Other Methods |
|
Parsimony, Maximum likelihood |
|
Evolutionary implications |
|
The Bias Problem |
|
sequence weighting |
|
sampling |
|
|
|
|
|
Expression Analysis |
|
Time Courses clustering |
|
Measuring differences |
|
Identifying Regulatory Regions |
|
Large scale cross referencing of information |
|
Function Classification and Orthologs |
|
The Genomic vs. Single-molecule Perspective |
|
|
|
|
|
Genome
Comparisons |
|
Ortholog Families, pathways |
|
Large-scale censuses |
|
Frequent Words Analysis |
|
Genome Annotation |
|
Trees from Genomes |
|
Identification of interacting proteins |
|
Structural Genomics |
|
Folds in Genomes, shared & common folds |
|
Bulk Structure Prediction |
|
Genome Trees |
|
|
|
|
|
|
|
|
Molecular Simulation |
|
Geometry -> Energy -> Forces |
|
Basic interactions, potential energy functions |
|
Electrostatics |
|
VDW Forces |
|
Bonds as Springs |
|
How structure changes over time? |
|
How to measure the change in a vector (gradient) |
|
Molecular Dynamics & MC |
|
Energy Minimization |
|
Parameter Sets |
|
Number Density |
|
Poisson-Boltzman Equation |
|
Lattice Models and Simplification |
|
|
|
|
|
|
|
|
|
|
Digital Libraries |
|
Automated Bibliographic Search and Textual
Comparison |
|
Knowledge bases for biological literature |
|
Motif Discovery Using Gibb's Sampling |
|
Methods for Structure Determination |
|
Computational Crystallography |
|
Refinement |
|
NMR Structure Determination |
|
Distance Geometry |
|
Metabolic Pathway Simulation |
|
The DNA Computer |
|
|
|
|
|
|
|
|
|
|
(YES?) Digital Libraries |
|
Automated Bibliographic Search and Textual
Comparison |
|
Knowledge bases for biological literature |
|
(YES) Motif Discovery Using Gibb's Sampling |
|
(NO?) Methods for Structure Determination |
|
Computational Crystallography |
|
Refinement |
|
NMR Structure Determination |
|
(YES) Distance Geometry |
|
(YES) Metabolic Pathway Simulation |
|
(NO) The DNA Computer |
|
|
|
|
|
|
|
|
|
Gene identification by sequence inspection |
|
Prediction of splice sites |
|
DNA methods in forensics |
|
Modeling of Populations of Organisms |
|
Ecological Modeling |
|
Genomic Sequencing Methods |
|
Assembling Contigs |
|
Physical and genetic mapping |
|
Linkage Analysis |
|
Linking specific genes to various traits |
|
|
|
|
|
(YES) Gene identification by sequence inspection |
|
Prediction of splice sites |
|
(YES) DNA methods in forensics |
|
(NO) Modeling of Populations of Organisms |
|
Ecological Modeling |
|
(NO?) Genomic Sequencing Methods |
|
Assembling Contigs |
|
Physical and genetic mapping |
|
(YES) Linkage Analysis |
|
Linking specific genes to various traits |
|
|
|
|
|
RNA structure prediction
Identification in sequences |
|
Radiological Image Processing |
|
Computational Representations for Human Anatomy
(visible human) |
|
Artificial Life Simulations |
|
Artificial Immunology / Computer Security |
|
Genetic Algorithms in molecular biology |
|
Homology modeling |
|
Determination of Phylogenies Based on
Non-molecular Organism Characteristics |
|
Computerized Diagnosis based on Genetic Analysis
(Pedigrees) |
|
|
|
|
|
(YES) RNA structure prediction
Identification in sequences |
|
(NO) Radiological Image Processing |
|
Computational Representations for Human Anatomy
(visible human) |
|
(NO) Artificial Life Simulations |
|
Artificial Immunology / Computer Security |
|
(NO?) Genetic Algorithms in molecular biology |
|
(YES) Homology modeling |
|
(NO) Determination of Phylogenies Based on
Non-molecular Organism Characteristics |
|
(NO) Computerized Diagnosis based on Genetic
Analysis (Pedigrees) |
|
|
|
|
(Molecular) Bio - informatics |
|
One idea for a definition?
Bioinformatics is conceptualizing biology
in terms of molecules (in the sense of physical-chemistry) and then
applying “informatics” techniques (derived from disciplines such as applied
math, CS, and statistics) to understand and organize the information
associated with these molecules, on
a large-scale. |
|
Bioinformatics is “MIS” for Molecular Biology
Information. It is a practical discipline with many applications. |
|
|
|
|
|
Understanding How Structures Bind Other
Molecules (Function) |
|
Designing Inhibitors |
|
Docking, Structure Modeling |
|
|
|
(From left to right, figures adapted from
Olsen Group Docking Page at Scripps, Dyson NMR Group Web page at Scripps,
and from Computational Chemistry Page at Cornell Theory Center). |
|
|
|
|
|
|
|
Find Similar Ones in Different Organisms |
|
Human vs. Mouse
vs. Yeast |
|
Easier to do Expts. on latter! |
|
(Section from NCBI Disease Genes Database
Reproduced Below.) |
|
|
|
|
|
Cross-Referencing, one thing to another thing |
|
Sequence Comparison and Scoring |
|
Analogous Problems for Structure Comparison |
|
Comparison has two parts: |
|
(1) Optimally Aligning 2 entities to get a
Comparison Score |
|
(2) Assessing Significance of this score in a
given Context |
|
|
|
Integrated Presentation |
|
Align Sequences |
|
Align Structures |
|
Score in a Uniform Framework |
|
|
|
|
|
|
|
|
|
|
|
|
|
Overall Occurrence of a Certain Feature in the
Genome |
|
e.g. how many kinases in Yeast |
|
Compare Organisms and Tissues |
|
Expression levels in Cancerous vs Normal Tissues |
|
Databases, Statistics |
|
|
|
(Clock figures, yeast v. Synechocystis,
adapted from GeneQuiz Web Page, Sander Group, EBI) |
|
|
|
|
|