Molecular Biophysics & Biochemistry 447b3 / 747b3Bioinformatics
Relational Databases
- Databases make program data persistent
- RDB’s turn formless data in a number of structured tables
- Ways of joining together tables to give various views of the data
UnstructuredData
This type of “membership” analysis has been performed previously in terms of the occurrence of sequence motifs, families, functions, and biochemical pathways. Starting from the most basic units, genomes have been compared in terms of the relative frequencies of short oligonucleotide and oligopeptide “words” (Blaisdell et al., 1996; Karlin & Burge, 1995; Karlin et al., 1992; Karlin et al., 1996). The degree of gene duplication in a number of genomes has been ascertained (Brenner et al., 1995; Koonin et al., 1996b; Riley & Labedan, 1997; Wolfe & Shields, 1997; Gerstein, 1997; Tamames et al., 1997). Other analyses have looked at how many highly conserved sequence families in one organism are present in another (Green et al., 1993; Koonin et al., 1995; Tatusov et al., 1997; Ouzounis et al., 1995a,b; Clayton et al., 1997). Finally, if sequences can be related to specific functions and pathways, one can see whether homologous sequences in two organisms truly have the same role (ortholog vs. paralog) and whether particular pathways are present or absent in different organisms (Karp et al., 1996a; Karp et al., 1996b; Koonin et al., 1996a; Mushegian & Koonin, 1996; Tatusov et al., 1996, 1997). This work has yielded many interesting conclusions in terms of pathways that are modified or absent in certain organisms. For instance, the essential citric acid cycle is found to be highly modified in H. influenzae (Fleischmann et al., 1995; Tatusov et al., 1996). Furthermore, identifying pathways and proteins unique to certain microbes may prove useful for developing drugs (e.g. antibiotics against bacteria, Tatusov et al., 1997). In some genome annotation systems, attempts have been made to integrate a variety of membership analyses and perform them on a large scale in a highly automated fashion (Bork et al., 1992a; Bork et al., 1992b; Scharf et al., 1994; Casari et al., 1995; Ouzounis et al., 1995a; Gaasterland & Sensen, 1996).
Semi-Structured Data
REMARK 8 HET GROUP TRIVIAL NAME: FLAVIN ADENINE DINUCLEOTIDE (FAD) 1FNB 79
REMARK 8 CAS REGISTRY NUMBER: 146-14-5 1FNB 80
REMARK 8 SEQUENCE NUMBER: 315 1FNB 81
REMARK 8 NUMBER OF ATOMS IN GROUP: 53 1FNB 82
REMARK 8 HET GROUP TRIVIAL NAME: PHOSPHATE 1FNB 84
REMARK 8 SEQUENCE NUMBER: 316 1FNB 85
REMARK 8 NUMBER OF ATOMS IN GROUP: 5 1FNB 86
REMARK 8 HET GROUP TRIVIAL NAME: SULFATE 1FNB 88
REMARK 8 SEQUENCE NUMBER: 317 1FNB 89
REMARK 8 NUMBER OF ATOMS IN GROUP: 5 1FNB 90
REMARK 8 HET GROUP TRIVIAL NAME: K2 PT(CN)4 1FNB 92
REMARK 8 CHARGE: 2- ( PT(CN)4 -- ) 1FNB 93
REMARK 8 SEQUENCE NUMBER: PT1 - PT7 1FNB 94
REMARK 8 NUMBER OF ATOMS IN GROUP: 9 1FNB 95
REMARK 8 ADDITIONAL COMMENTS: BINDING SITES USED IN MIR PHASING 1FNB 96
REMARK 8 HEAVY ATOM PARAMETERS ARE AS FOLLOWS: 1FNB 98
REMARK 8 PT PT 1 11.832 -8.309 27.027 0.68 33.00 1FNB 99
REMARK 8 PT PT 2 13.996 -2.135 13.212 0.42 40.00 1FNB 100
REMARK 8 PT PT 3 33.293 18.752 27.229 0.32 42.00 1FNB 101
REMARK 8 PT PT 4 19.961 -15.348 -10.328 0.23 28.00 1FNB 102
REMARK 8 PT PT 5 8.312 14.713 35.679 0.26 31.00 1FNB 103
REMARK 8 PT PT 6 27.594 -7.790 23.540 0.14 35.00 1FNB 104
REMARK 8 PT PT 7 15.917 -9.001 12.608 0.30 50.00 1FNB 105
REMARK 8 HET GROUP TRIVIAL NAME: URANYL NITRATE (UO2--) 1FNB 107
REMARK 8 EMPIRICAL FORMULA: UO2 (NO3)2 1FNB 108
REMARK 8 CHARGE: 2- 1FNB 109
REMARK 8 SEQUENCE NUMBER: UR1 - UR13 1FNB 110
REMARK 8 NUMBER OF ATOMS IN GROUP: 3 1FNB 111
REMARK 8 ADDITIONAL COMMENTS: BINDING SITES USED IN MIR PHASING 1FNB 112
REMARK 8 HEAVY ATOM PARAMETERS ARE AS FOLLOWS: 1FNB 114
REMARK 8 U UR 1 8.513 16.214 36.081 0.49 27.00 1FNB 115
Structured Data
fid_ bestrep N_minsp N_scop objname
1.001.001 d1flp__ 8 340 Globin-like
1.001.002 d1hdj__ 4 33 Long alpha-hairpin
1.001.003 d1ctj__ 9 78 Cytochrome c
1.001.004 d1enh__ 18 76 DNA-binding 3-helical bundle
1.001.005 d1dtr_2 1 3 Diphtheria toxin repressor (DtxR) dimeriz
1.001.006 d1tns__ 1 2 Mu transposase, DNA-binding domain
1.001.007 d2spca_ 1 2 Spectrin repeat unit
1.001.008 d1bdd__ 1 4 Immunoglobulin-binding protein A modules
1.001.009 d1bal__ 1 5 Peripheral subunit-binding domain of 2-ox
1.001.010 d2erl__ 3 5 Protozoan pheromone proteins
gid_ TrgStrt TrgStop did HI0299 119 135 d193l__ HI0572 180 240 d1aba__ HI0989 56 125 d1aco_1 HI0988 106 458 d1aco_2 HI0154 2 76 d1acp__ HI1633 2 432 d1adea_ HI0349 1 183 d1aky__ HI1309 35 52 d1alo_3 HI0589 8 25 d1alo_3 HI1358 239 444 d1amg_2 HI1358 218 410 d1amy_2 HI0460 20 24 d1ans__ HI1386 139 147 d1ans__ HI0421 11 14 d1ans__ HI0361 285 295 d1ans__ HI0835 100 106 d1ans__
SQL
- SIMPLE Language for Building and Querying Tables
- SELECT various entries from it (tuples, rows)
- Example: How Many Globin Foldsare there in E. coli versus Yeast?
matches table
gid_ TrgStrt TrgStop did score
HI0299 119 135 d193l__ 3.1
HI0572 180 240 d1aba__ 0.0032
HI0989 56 125 d1aco_1 0.0049
HI0988 106 458 d1aco_2 4.4e-14
HI0154 2 76 d1acp__ 1.2e-23
HI0349 1 183 d1aky__ 7.6e-36
HI1358 239 444 d1amg_2 0.002
HI1358 218 410 d1amy_2 0.00037
HI1386 139 147 d1ans__ 3.3
HI0361 285 295 d1ans__ 8.2
HI0835 100 106 d1ans__ 9.7
# Match in Gene TrgStop int,
matches table 2
gid_ TrgStrt TrgStop did score
HI0299 119 135 d193l__ 3.1
HI0572 180 240 d1aba__ 0.0032
HI0989 56 125 d1aco_1 0.0049
HI0988 106 458 d1aco_2 4.4e-14
HI0154 2 76 d1acp__ 1.2e-23
HI0349 1 183 d1aky__ 7.6e-36
HI1358 239 444 d1amg_2 0.002
HI1358 218 410 d1amy_2 0.00037
HI1386 139 147 d1ans__ 3.3
HI0361 285 295 d1ans__ 8.2
HI0835 100 106 d1ans__ 9.7
(HI0299, 119, 135, d193l__, 3.1)
structures table
folds table
# number of helices & sheets
fid_ bestrep N_hlx N_beta name
1.001.001 d1flp__ 8 0 Globin-like
1.001.002 d1hdj__ 4 0 Long alpha-hairpin
1.001.003 d1ctj__ 9 0 Cytochrome c
1.001.004 d1enh__ 2 0 DNA-binding 3-helical bundle
1.001.005 d1dtr_2 1 3 Diphtheria toxin repressor (DtxR) dimeriz
1.001.006 d1tns__ 1 2 Mu transposase, DNA-binding domain
1.001.007 d2spca_ 0 2 Spectrin repeat unit
1.001.008 d1bdd__ 0 4 Immunoglobulin-binding protein A modules
1.001.009 d1bal__ 0 5 Peripheral subunit-binding domain of 2-ox
1.001.010 d2erl__ 3 5 Protozoan pheromone proteins
Table Interpretation
Match Table: Ways Structures A, B, and C can match HI Genome
Structures have a limited number of folds, which have various characteristics
Structure of a Table
- Column
- Field
- Attribute of an Entity
- dimension
- Key
- Certain Attributes (or combination of attributes) can uniquely identify an object, these are keys
What is a Key?
table matches(gid, TrgStrt, TrgStop, did, score)
table structures(did, fid)
table folds(fid, bestrep, N_hlx, N_beta, name)
gid,TrgStrt -> unique match (one tuple)
thus, primary key gid,TrgStrt
gid,TrgStop -> unique match as well
fid -> many did’s, but did -> one fid
one-to-one between fid and name
SQL Select on a Single Table
- Select {columns} from {a table} where {row-selection is true}
- projection of a selection
- Sort result on a attribute
SQL Select on a Single Table, Example
- Select * from matches where gid= HI0016
HI0016 1 173 d1dar_2 2e-07
HI0016 179 274 d1dar_1 8.5e-06
HI0016 399 476 d1dar_4 0.00031
- Select * from matches where gid= HI0016 and TrgStrt=179
HI0016 179 274 d1dar_1 8.5e-06
gid_ TrgStrt TrgStop did score
HI0299 119 135 d193l__ 3.1
HI0572 180 240 d1aba__ 0.0032
HI0989 56 125 d1aco_1 0.0049
HI0349 1 183 d1aky__ 7.6e-36
HI1358 239 444 d1amg_2 0.002
HI0016 1 173 d1dar_2 2e-07
HI0016 179 274 d1dar_1 8.5e-06
HI0016 399 476 d1dar_4 0.00031
HI1386 139 147 d1ans__ 3.3
HI0361 285 295 d1ans__ 8.2
HI0835 100 106 d1ans__ 9.7
SQL Select on a Single Table, Example 2
- Select did from matches where score < 0.0001
d1aky__, d1dar_2, d1dar_1
HI0349 1 183 d1aky__ 7.6e-36
I0016 1 173 d1dar_2 2e-07
HI0016 179 274 d1dar_1 8.5e-06
gid_ TrgStrt TrgStop did score
HI0299 119 135 d193l__ 3.1
HI0572 180 240 d1aba__ 0.0032
HI0989 56 125 d1aco_1 0.0049
HI0349 1 183 d1aky__ 7.6e-36
HI1358 239 444 d1amg_2 0.002
HI0016 1 173 d1dar_2 2e-07
HI0016 179 274 d1dar_1 8.5e-06
HI0016 399 476 d1dar_4 0.00031
HI1386 139 147 d1ans__ 3.3
HI0361 285 295 d1ans__ 8.2
HI0835 100 106 d1ans__ 9.7
Joins
gid_ TrgStrt TrgStop did score
HI0299 119 135 d193l__ 3.1
HI0572 180 240 d1aba__ 0.0032
HI0989 56 125 d1aco_1 0.0049
HI0988 106 458 d1aco_2 4.4e-14
HI0154 2 76 d1acp__ 1.2e-23
HI0349 1 183 d1aky__ 7.6e-36
HI1358 239 444 d1amg_2 0.002
HI1358 218 410 d1amy_2 0.00037
HI1386 139 147 d1ans__ 3.3
HI0361 285 295 d1ans__ 8.2
HI0835 100 106 d1ans__ 9.7
fid_ bestrep N_hlx N_beta name
1.001.001 d1flp__ 8 0 Globin-like
1.001.002 d1hdj__ 4 0 Long alpha-hairpin
1.001.003 d1ctj__ 9 0 Cytochrome c
1.001.004 d1enh__ 2 0 DNA-binding 3-helical bundle
1.001.005 d1dtr_2 1 3 Diphtheria toxin repressor (DtxR) dimeriz
1.001.006 d1tns__ 1 2 Mu transposase, DNA-binding domain
1.001.007 d2spca_ 0 2 Spectrin repeat unit
1.001.008 d1bdd__ 0 4 Immunoglobulin-binding protein A modules
1.007.008 d1qkt__ 4 3 Neurotoxin III (ATX III)
1.001.010 d2erl__ 3 5 Protozoan pheromone proteins
SQL Select on Multiple Tables
- Select * from matches, structures, foldswherematches.gid = HI0361and matches.did=structures.didand structures.fid = folds.fid
- Returnsmatches | structures | folds HI0361,285,295,d1ans__ ,8.2 | d1ans__,1.007.008 | 1.007.008,d1qkt__,4, 3,Neurotoxin III ...
- Select score,name from matches, structures, folds where gid = HI0361and matches.did=structures.didand structures.fid = folds.fid 8.2, Neurotoxin III ...
Foreign Key
gid_ TrgStrt TrgStop did score
HI0299 119 135 d193l__ 3.1
HI0572 180 240 d1aba__ 0.0032
HI0989 56 125 d1aco_1 0.0049
HI0988 106 458 d1aco_2 4.4e-14
HI0154 2 76 d1acp__ 1.2e-23
HI0349 1 183 d1aky__ 7.6e-36
HI1358 239 444 d1amg_2 0.002
HI1358 218 410 d1amy_2 0.00037
HI1386 139 147 d1ans__ 3.3
HI0361 285 295 d1ans__ 8.2
HI0835 100 106 d1ans__ 9.7
matches.did is a (foreign) key in the structures table -- i.e. looks up exactly one structure.
Selection as Array Lookup
- Same for a fold identifier from a structure id
- $fid=$structure{$did}
- (perl pseudo-code)
- Same for matches and folds tables, but this time arrays return multiple values and have multiple field keys
- ($bestrep, $N_hlx, $N_beta, $name) = $folds{$fid}
- ($TrgStop,$did,$score)=$match{$gid,$TrgStrt}
- Joining as a double-lookup
- $did = 1mbd__($bestrep, $N_hlx, $N_beta, $name) = $folds{ $structures{$did} }
- Select bestrep,N_hlx,N_beta,name from structures, folds where structures.fid = folds.fid and structures.did = 1mbd__
SQL Select on Multiple Tables
- Select {columns} from {huge cross-product of tables} where {row-selection is true}
- cross-product T(1) x T(2) builds a huge virtual table where every row of T(1) is paired with every row of T(2). Then perform selection on this.
- Select fid from matches,structures where gid=HI009 and matches.did = structures.did
Cross Product A x B
A(1) = Row 1 of Table AA(2) = Row 2 of Table AA(i) = Row i of Table A
A has N rowsand C columns
B(1) = Row 1 of Table BB(2) = Row 2 of Table BB(i) = Row i of Table B
B has M rowsand K columns
A x B hasN x M rows and C+K columns
A(1)B(1)A(1)B(2)A(1)B(3)...A(1)B(M)A(2)B(1)A(2)B(2)A(2)B(3)...A(2)B(M)A(N)B(1)A(N)B(2)A(N)B(3)...A(N)B(M)
ER-diagrams
- Korth & Silberschatz
- branch <=> matches (gid-start +++ did)
- customer <=> folds (fid +++)
- linked by account <=> structures (did fid)
Aggregate Functions--Statistics on Attributes
- Query Statistics
- select gid, count (distinct did) from matches
- select max(N_hlx) from folds where N_beta = 0
- How many matches to globins in the E. coli genome
- Complex Query by nesting selections
- F <= select fid from folds where name contains “globin”
- D <= select did from structures where fid in F
- N <= select count(distinct gid,TrgStrt) from matcheswhere did in D and score < .01
Joins
gid_ TrgStrt TrgStop did score
HI0299 119 135 d193l__ 3.1
HI0572 180 240 d1aba__ 0.0032
HI0989 56 125 d1aco_1 0.0049
HI0988 106 458 d1aco_2 4.4e-14
HI0154 2 76 d1acp__ 1.2e-23
HI0349 1 183 d1aky__ 7.6e-36
HI1358 239 444 d1amg_2 0.002
HI1358 218 410 d1amy_2 0.00037
HI1386 139 147 d1ans__ 3.3
HI0361 285 295 d1ans__ 8.2
HI0835 100 106 d1ans__ 9.7
fid_ bestrep N_hlx N_beta name
1.001.001 d1flp__ 8 0 Globin-like
1.001.002 d1hdj__ 4 0 Long alpha-hairpin
1.001.003 d1ctj__ 9 0 Cytochrome c
1.001.004 d1enh__ 2 0 DNA-binding 3-helical bundle
1.001.005 d1dtr_2 1 3 Diphtheria toxin repressor (DtxR) dimeriz
1.001.006 d1tns__ 1 2 Mu transposase, DNA-binding domain
1.001.007 d2spca_ 0 2 Spectrin repeat unit
1.001.008 d1bdd__ 0 4 Immunoglobulin-binding protein A modules
1.007.008 d1qkt__ 4 3 Neurotoxin III (ATX III)
1.001.010 d2erl__ 3 5 Protozoan pheromone proteins
Join Gives Unnormalized Table
gid_ TrgStrt TrgStop did score fid N_hlx N_beta name
HI0299 119 135 d193l__ 3.1 1.010.002 0 2 Spectrin repeat unit
HI0572 180 240 d1aba__ 0.0032 1.002.045 1 2 Mu transposase, DNA-binding domain
HI0989 56 125 d1aco_1 0.0049 1.001.031 8 0 Globin-like
HI0988 106 458 d1aco_2 4.4e-14 1.001.031 8 0 Globin-like
HI0154 2 76 d1acp__ 1.2e-23 1.001.031 8 0 Globin-like
HI1633 2 432 d1adea_ 0 1.010.002 0 2 Spectrin repeat unit
HI0349 1 183 d1aky__ 7.6e-36 1.001.031 8 0 Globin-like
HI1309 35 52 d1alo_3 1.1 1.007.008 4 3 Neurotoxin III (ATX III)
HI0589 8 25 d1alo_3 1.8 1.002.045 1 2 Mu transposase, DNA-binding domain
HI1358 239 444 d1amg_2 0.002 1.004.002 1 3 Diphtheria toxin repressor (DtxR)
HI1358 218 410 d1amy_2 0.00037 1.002.044 0 4 Immunoglobulin-binding protein A
HI0460 20 24 d1ans__ 1.8 1.007.008 4 3 Neurotoxin III (ATX III)
HI1386 139 147 d1ans__ 3.3 1.007.008 4 3 Neurotoxin III (ATX III)
HI0421 11 14 d1ans__ 6.4 1.007.008 4 3 Neurotoxin III (ATX III)
HI0361 285 295 d1ans__ 8.2 1.007.008 4 3 Neurotoxin III (ATX III)
HI0835 100 106 d1ans__ 9.7 1.007.008 4 3 Neurotoxin III (ATX III)
Joining Two or More Tables with a Select Query Gives a New, “Bigger” Table
Normalization
gid_ TrgStrt TrgStop did score fid N_hlx N_beta name
HI0299 119 135 d193l__ 3.1 1.010.002 0 2 Spectrin repeat unit
HI0572 180 240 d1aba__ 0.0032 1.002.045 1 2 Mu transposase, DNA-binding domain
HI0989 56 125 d1aco_1 0.0049 1.001.031 8 0 Globin-like
HI0988 106 458 d1aco_2 4.4e-14 1.001.031 8 0 Globin-like
HI0154 2 76 d1acp__ 1.2e-23 1.001.031 8 0 Globin-like
HI1633 2 432 d1adea_ 0 1.010.002 0 2 Spectrin repeat unit
HI0349 1 183 d1aky__ 7.6e-36 1.001.031 8 0 Globin-like
HI1309 35 52 d1alo_3 1.1 1.007.008 4 3 Neurotoxin III (ATX III)
HI0589 8 25 d1alo_3 1.8 1.002.045 1 2 Mu transposase, DNA-binding domain
HI1358 239 444 d1amg_2 0.002 1.004.002 1 3 Diphtheria toxin repressor (DtxR)
HI1358 218 410 d1amy_2 0.00037 1.002.044 0 4 Immunoglobulin-binding protein A
HI0460 20 24 d1ans__ 1.8 1.007.008 4 3 Neurotoxin III (ATX III)
HI1386 139 147 d1ans__ 3.3 1.007.008 4 3 Neurotoxin III (ATX III)
HI0421 11 14 d1ans__ 6.4 1.007.008 4 3 Neurotoxin III (ATX III)
HI0361 285 295 d1ans__ 8.2 1.007.008 4 3 Neurotoxin III (ATX III)
HI0835 100 106 d1ans__ 9.7 1.007.008 4 3 Neurotoxin III (ATX III)
- What if Want to update Fold 1.007.008 to be “Neurotoxin IV”?
- So Good if Previously Normalized into Separate Tables
- Eliminate Redundancy
- Allow Consistent Updating
Normalization Example
Name City Area-Code Phone-Number
Normalized Tables
gid_ TrgStrt TrgStop did score
HI0299 119 135 d193l__ 3.1
HI0572 180 240 d1aba__ 0.0032
HI0989 56 125 d1aco_1 0.0049
HI0988 106 458 d1aco_2 4.4e-14
HI0154 2 76 d1acp__ 1.2e-23
HI0349 1 183 d1aky__ 7.6e-36
HI1358 239 444 d1amg_2 0.002
HI1358 218 410 d1amy_2 0.00037
HI1386 139 147 d1ans__ 3.3
HI0361 285 295 d1ans__ 8.2
HI0835 100 106 d1ans__ 9.7
fid_ bestrep N_hlx N_beta name
1.001.001 d1flp__ 8 0 Globin-like
1.001.002 d1hdj__ 4 0 Long alpha-hairpin
1.001.003 d1ctj__ 9 0 Cytochrome c
1.001.004 d1enh__ 2 0 DNA-binding 3-helical bundle
1.001.005 d1dtr_2 1 3 Diphtheria toxin repressor (DtxR) dimeriz
1.001.006 d1tns__ 1 2 Mu transposase, DNA-binding domain
1.001.007 d2spca_ 0 2 Spectrin repeat unit
1.001.008 d1bdd__ 0 4 Immunoglobulin-binding protein A modules
1.007.008 d1qkt__ 4 3 Neurotoxin III (ATX III)
1.001.010 d2erl__ 3 5 Protozoan pheromone proteins
Query Optimization
- Get at the Data Quickly!!
- Hash Function Reproduce the Effect of Indexes
- Rapidly Associate a Bucket with Each Key
- Joining 10 tables, which to do first?
- Joining is slow so store some tables in unnormalized form
Indexes Speed Access
Object Databases
Forms & reports [user views]
- Reports are the result of running a succession of selects queries on a database, joining together a number of tables, and then pasting the results together
- Forms are the same but they are editable
- Forms and Reports represent particular views of the data
- For instance, one can be keyed on gene id listing all the structures matching a gene and the other could be keyed on structure id listing all the gene matching a given structure
Example Report: Motions Database
Example Report: Motions Database
CREATE TABLE classifications (
CREATE TABLE descriptions (
CREATE TABLE single_vals (
CREATE TABLE structures (
CREATE TABLE value_names (
CREATE TABLE endnote_refs (
Report shows information, merging together many tables with variable amounts of information. Form same but allows entry.
Example Report: Motions Database
Structures: Variable Number Per ID (Var. Num. of Phone Num. per Person), Foreign Key into PDB
Example Report: Motions Database
Joining Two Tables and Iterating in Perl
$sth = $dbh->query("SELECT value_names.name,
single_vals.val,single_vals.comment ".
"FROM value_names,single_vals ".
"WHERE single_vals.id_ = '$id' AND
single_vals.name_ = value_names.abbrev_ ".
"ORDER BY value_names.name");
&PrintHead("Particular values describing motion");
for ($i=0; $i<$rows; $i++) {
@values = $sth->fetchrow;
PrintSingleVals(@values);
Example Report: Motions Database
aat 7 Aspartate Amino Transferase (AAT)
acetyl 1005 Acetylcholinesterase
br 97 Bacteriorhodopsin (bR)
313 S Subramaniam, M Gerstein, D Oesterhelt and R H Hender
893 R Henderson, J M Baldwin, T A Ceska, F Zemlin, E Beckm
1007 M K Gilson, T P Straatsma, JA A McCammon, D R Ripoll,
647 W E Meador, A R Means and F A Quiocho (1992). Target e
648 M Ikura, G M Clore, A M Gronenborn, G Zhu, C B Klee an
649 B-H Oh, J Pandit, C-H Kang, K Nikaido, S Gokcen, G F-L
Join Two Lists (Protein Names
and References) with a Table Containing Key for each List (a Relation: protein has reference.)
SELECT endnote_refs.name, refs.medline_I FROM endnote_refs,refs WHERE refs.id_ = ’cm' AND refs.endnote_I = endnote_refs.num_I
Example Report: Motions Database
How to Store Complex Data? (File Pointers, BLOBS, OODB)
Packing Based Classification
Packing Based Classification:Hinge v Shear
Shear Mechanism Involves Many Small Motions across a Continuously Maintained Interface
Hinge Mechanism involves absence of steric constraints, esp. at hinge
Simple Forms Example: Links Page
http://bioinfo.mbb.yale.edu/ius/?MIval=links&page=course
Aspects of Forms:Transactions and Security
- Transactions
- Genome Centers and United Airlines!
- Log each entry and enable UNDO
- Security
- Only certain users can modify certain fields
Large-scale Example: Census DB
Molecular Biology Information:Whole Genomes
- The Revolution Driving Everything
Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F., Kerlavage, A. R., Bult, C. J., Tomb, J. F., Dougherty, B. A., Merrick, J. M., McKenney, K., Sutton, G., Fitzhugh, W., Fields, C., Gocayne, J. D., Scott, J., Shirley, R., Liu, L. I., Glodek, A., Kelley, J. M., Weidman, J. F., Phillips, C. A., Spriggs, T., Hedblom, E., Cotton, M. D., Utterback, T. R., Hanna, M. C., Nguyen, D. T., Saudek, D. M., Brandon, R. C., Fine, L. D., Fritchman, J. L., Fuhrmann, J. L., Geoghagen, N. S. M., Gnehm, C. L., McDonald, L. A., Small, K. V., Fraser, C. M., Smith, H. O. & Venter, J. C. (1995). "Whole-genome random sequencing and assembly of Haemophilus influenzae rd." Science 269: 496-512.
(Picture adapted from TIGR website, http://www.tigr.org)
- Integrative Data
1995, HI (bacteria): 1.6 Mb & 1600 genes done
1997, yeast: 13 Mb & ~6000 genes for yeast
1998: 14 completed genomes!
1998, worm: 75 of 100 Mb done with 13 K genes so far
2003, human: 3 Gb & 100 K genes...
-
Explonential Growth of Data Matched by Development of Computer Technology
- CPU vs Disk & Net
- As important as the increase in computer speed has been, the ability to store large amounts of information on computers is even more crucial
- Driving Force in Bioinformatics
(Internet picture adaptedfrom D Brutlag, Stanford)
Num.Protein DomainStructures
Major Application II:Overall Genome Characterization
- Overall Occurrence of a Certain Feature in the Genome
- e.g. how many kinases in Yeast
- Compare Organisms and Tissues
- Expression levels in Cancerous vs Normal Tissues
(Clock figures, yeast v. Synechocystis, adapted from GeneQuiz Web Page, Sander Group, EBI)
Shared Folds in Genomes are a/b
- Compare Genomes against Fold Library
- HI (bacteria, ~1700 ORFs)
- MJ (archeon, ~1700)
- SC (eukaryote, ~6200)
- EC, MP, MG, HP, SS as well.135 of ~300 known folds in all 3 (106 HI vs. 54 MJ, PDB bias)
- 45 shared by all, and 38 of these mixed-ab
What are the Most Common Folds in the Genomes?How many of these are shared?
5 Folds in Top-10 All 3 Genomes
All share a/b structure with repeated R.H. bab units connecting adjacent strands or nearly so (18+4+2 of 24)
TM-helix “prediction”
- TM prediction (KD, GES). Count number with 2 peaks, 3 peaks, &c.
- Yeast has more mem. prots., esp. 2-TMs
- No preference for particular supersecondary structures: 7-TM’s
2º Structure Prediction
Each a.a. has different propensity for local structure ? Different Compositions ? Different Local Structure
However, bulk prediction of 2º struc. gives same fraction of a and b in the genomes (by element, half a, half b)
Supersecondary structure words
- Look at super-secondary patterns (“words” such as aa or bab) in predictions
- Compare observed freq. with expected freq. odds = f(ab)/f(a)f(b)(Freq. Words, Karlin)
- Find
HI more aa, aaa, aaaa ...
SC more bb, bbb, bbbbb...
MJ more abab, baba …
Complex Data Example:Encoding Trees in RDBs
Node Name1 Organism2 Bacteria3 Archea4 Eukarya5 Metazoa6 Plants
Node Parent 1 02 13 14 15 36 3
RDBs Everywhere: Internet Mail
RDBs Everywhere: File System
INODE SIZE PERMISSION USER GROUP BYTES MMM-DD--YEAR NAME
120462 1 drwxr-xr-x 10 mbg gerstein 1024 Feb 12 1997 .
120463 1 drwxr-xr-x 2 mbg gerstein 1024 Jan 30 1997 ./hi-tbl
120464 514 -rw-r--r-- 1 mbg gerstein 525335 Nov 10 1996 ./hi-tbl/id_gorss.tbl
120465 19 -rw-r--r-- 1 mbg gerstein 18469 Nov 10 1996 ./hi-tbl/id_kytedool.tbl
120466 514 -rw-r--r-- 1 mbg gerstein 525372 Nov 10 1996 ./hi-tbl/id_seq.tbl
108224 507 -rw-r--r-- 1 mbg gerstein 518822 Nov 10 1996 ./mj-tbl/id_gorss.tbl
108227 54 -rw-r--r-- 1 mbg gerstein 54775 Jan 30 1997 ./mj-tbl/id_abcode.tbl
108228 19 -rw-r--r-- 1 mbg gerstein 19131 Nov 11 1996 ./mj-tbl/id_kytedool.tbl
108229 106 -rw-r--r-- 1 mbg gerstein 108345 Nov 16 1996 ./mj-tbl/word_stats.tbl.bak
108230 106 -rw-r--r-- 1 mbg gerstein 108354 Jan 28 1997 ./mj-tbl/word_stats.tbl
108231 7 -rw-r--r-- 1 mbg gerstein 6962 Jan 30 1997 ./mj-tbl/hist_seqlen.tbl
108232 7 -rw-r--r-- 1 mbg gerstein 6967 Jan 30 1997 ./mj-tbl/hist_num_H_res.tbl
91903 1 drwxr-xr-x 2 mbg gerstein 1024 Nov 19 1996 ./po-tbl
USER:PASSWD:UID:GID:COMMENT:DIR:SHELL
ftp:*:14:50:FTP User:/home/ftp:
mlml:cw5ZrAmNBAxvU:106:100:Michael Levitt (linux):/u1/mlml:/bin/tcsh
dabushne:ErR3hu4q0tO7Y:108:100:Dave:/u1/dabushne:/bin/tcsh
mbg:V9CPWXAG.mo3E:5514:165:Mark Gerstein,432A, BASS,2-6105,:/u0/mbg:/bin/tcsh
mbgmbg:V9CPWXAG.mo3E:5515:165:logs into mbg,,,,:/u0/mbg:/bin/tcsh
mbg10:V9CPWXAG.mo3E:5516:165:alternate account for mbg:/home/mbg10:/bin/tcsh
local::502:20:Local Installed Packages:/u1/local:/bin/tcsh
login::503:20:Hyper Login:/u0/login:/u0/login/hyper-login.pl