Molecular Biophysics & Biochemistry 447b3 / 747b3Bioinformatics

Mark Gerstein

Class 8, 2/4/98

Yale University

Relational Databases

Databases make program data persistent

RDB’s turn formless data in a number of structured tables
- Ways of joining together tables to give various views of the data

UnstructuredData

Semi-Structured Data

REMARK 8 CAS REGISTRY NUMBER: 146-14-5 1FNB 80

REMARK 8 SEQUENCE NUMBER: 315 1FNB 81

REMARK 8 NUMBER OF ATOMS IN GROUP: 53 1FNB 82

REMARK 8 1FNB 83

REMARK 8 HET GROUP TRIVIAL NAME: PHOSPHATE 1FNB 84

REMARK 8 SEQUENCE NUMBER: 316 1FNB 85

REMARK 8 NUMBER OF ATOMS IN GROUP: 5 1FNB 86

REMARK 8 1FNB 87

REMARK 8 HET GROUP TRIVIAL NAME: SULFATE 1FNB 88

REMARK 8 SEQUENCE NUMBER: 317 1FNB 89

REMARK 8 NUMBER OF ATOMS IN GROUP: 5 1FNB 90

REMARK 8 1FNB 91

REMARK 8 HET GROUP TRIVIAL NAME: K2 PT(CN)4 1FNB 92

REMARK 8 CHARGE: 2- ( PT(CN)4 -- ) 1FNB 93

REMARK 8 SEQUENCE NUMBER: PT1 - PT7 1FNB 94

REMARK 8 NUMBER OF ATOMS IN GROUP: 9 1FNB 95

REMARK 8 ADDITIONAL COMMENTS: BINDING SITES USED IN MIR PHASING 1FNB 96

REMARK 8 1FNB 97

REMARK 8 HEAVY ATOM PARAMETERS ARE AS FOLLOWS: 1FNB 98

REMARK 8 PT PT 1 11.832 -8.309 27.027 0.68 33.00 1FNB 99

REMARK 8 PT PT 2 13.996 -2.135 13.212 0.42 40.00 1FNB 100

REMARK 8 PT PT 3 33.293 18.752 27.229 0.32 42.00 1FNB 101

REMARK 8 PT PT 4 19.961 -15.348 -10.328 0.23 28.00 1FNB 102

REMARK 8 PT PT 5 8.312 14.713 35.679 0.26 31.00 1FNB 103

REMARK 8 PT PT 6 27.594 -7.790 23.540 0.14 35.00 1FNB 104

REMARK 8 PT PT 7 15.917 -9.001 12.608 0.30 50.00 1FNB 105

REMARK 8 1FNB 106

REMARK 8 HET GROUP TRIVIAL NAME: URANYL NITRATE (UO2--) 1FNB 107

REMARK 8 EMPIRICAL FORMULA: UO2 (NO3)2 1FNB 108

REMARK 8 CHARGE: 2- 1FNB 109

REMARK 8 SEQUENCE NUMBER: UR1 - UR13 1FNB 110

REMARK 8 NUMBER OF ATOMS IN GROUP: 3 1FNB 111

REMARK 8 ADDITIONAL COMMENTS: BINDING SITES USED IN MIR PHASING 1FNB 112

REMARK 8 1FNB 113

REMARK 8 HEAVY ATOM PARAMETERS ARE AS FOLLOWS: 1FNB 114

REMARK 8 U UR 1 8.513 16.214 36.081 0.49 27.00 1FNB 115

Structured Data

d2rs51_ 1.002.007

d1imr__ 1.010.002

d1pyib1 1.007.030

d1dxtd_ 1.001.001

d181l__ 1.004.002

d1vmoa_ 1.002.044

d2gsq_1 1.001.031

d1etb2_ 1.002.003

d1guha1 1.001.031

d1hrc__ 1.001.003

d150lc_ 1.004.002

d1dmf__ 1.007.035

d1l19__ 1.004.002

d1yrnc_ 1.010.002

d1apld_ 1.001.004

d1ndab2 1.003.004

d2rmai_ 1.002.036

1.001.001 d1flp__ 8 340 Globin-like

1.001.002 d1hdj__ 4 33 Long alpha-hairpin

1.001.003 d1ctj__ 9 78 Cytochrome c

1.001.004 d1enh__ 18 76 DNA-binding 3-helical bundle

1.001.005 d1dtr_2 1 3 Diphtheria toxin repressor (DtxR) dimeriz

1.001.006 d1tns__ 1 2 Mu transposase, DNA-binding domain

1.001.007 d2spca_ 1 2 Spectrin repeat unit

1.001.008 d1bdd__ 1 4 Immunoglobulin-binding protein A modules

1.001.009 d1bal__ 1 5 Peripheral subunit-binding domain of 2-ox

1.001.010 d2erl__ 3 5 Protozoan pheromone proteins

SQL

SIMPLE Language for Building and Querying Tables

CREATE a table

INSERT values into it

SELECT various entries from it (tuples, rows)

UPDATE the values

Example: How Many Globin Foldsare there in E. coli versus Yeast?

matches table

HI0299 119 135 d193l__ 3.1

HI0572 180 240 d1aba__ 0.0032

HI0989 56 125 d1aco_1 0.0049

HI0988 106 458 d1aco_2 4.4e-14

HI0154 2 76 d1acp__ 1.2e-23

HI1633 2 432 d1adea_ 0

HI0349 1 183 d1aky__ 7.6e-36

HI1309 35 52 d1alo_3 1.1

HI0589 8 25 d1alo_3 1.8

HI1358 239 444 d1amg_2 0.002

HI1358 218 410 d1amy_2 0.00037

HI0460 20 24 d1ans__ 1.8

HI1386 139 147 d1ans__ 3.3

HI0421 11 14 d1ans__ 6.4

HI0361 285 295 d1ans__ 8.2

HI0835 100 106 d1ans__ 9.7

matches(gid char255,

# Genome_ID

TrgStrt int,

# Start of

# Match in Gene TrgStop int,

# End of Match

# in Gene did char255,

# ID Matching

# Structure score real

# e-value

# of Match

)

matches table 2

HI0299 119 135 d193l__ 3.1

HI0572 180 240 d1aba__ 0.0032

HI0989 56 125 d1aco_1 0.0049

HI0988 106 458 d1aco_2 4.4e-14

HI0154 2 76 d1acp__ 1.2e-23

HI1633 2 432 d1adea_ 0

HI0349 1 183 d1aky__ 7.6e-36

HI1309 35 52 d1alo_3 1.1

HI0589 8 25 d1alo_3 1.8

HI1358 239 444 d1amg_2 0.002

HI1358 218 410 d1amy_2 0.00037

HI0460 20 24 d1ans__ 1.8

HI1386 139 147 d1ans__ 3.3

HI0421 11 14 d1ans__ 6.4

HI0361 285 295 d1ans__ 8.2

HI0835 100 106 d1ans__ 9.7

matches

(gid, TrgStrt,

TrgStop, did,

score)

values

(HI0299, 119, 135, d193l__, 3.1)

structures table

structures(did char255,

# ID Matching

# Structure fid char255,

# ID of fold that

# structure has

)

d2rs51_ 1.002.007

d1imr__ 1.010.002

d1pyib1 1.007.030

d1dxtd_ 1.001.001

d181l__ 1.004.002

d1vmoa_ 1.002.044

d2gsq_1 1.001.031

d1etb2_ 1.002.003

d1guha1 1.001.031

d1hrc__ 1.001.003

d150lc_ 1.004.002

d1dmf__ 1.007.035

d1l19__ 1.004.002

d1yrnc_ 1.010.002

d1apld_ 1.001.004

d1ndab2 1.003.004

d2rmai_ 1.002.036

folds table

folds(fid char255,

# fold ID

bestrep char255,

N_hlx int,

N_beta int,

# number of helices & sheets

name char255

# name of fold

)

1.001.001 d1flp__ 8 0 Globin-like

1.001.002 d1hdj__ 4 0 Long alpha-hairpin

1.001.003 d1ctj__ 9 0 Cytochrome c

1.001.004 d1enh__ 2 0 DNA-binding 3-helical bundle

1.001.005 d1dtr_2 1 3 Diphtheria toxin repressor (DtxR) dimeriz

1.001.006 d1tns__ 1 2 Mu transposase, DNA-binding domain

1.001.007 d2spca_ 0 2 Spectrin repeat unit

1.001.008 d1bdd__ 0 4 Immunoglobulin-binding protein A modules

1.001.009 d1bal__ 0 5 Peripheral subunit-binding domain of 2-ox

1.001.010 d2erl__ 3 5 Protozoan pheromone proteins

Table Interpretation

Structure of a Table

Row
- Entity, Tuple, Instance

Column
- Field
- Attribute of an Entity
- dimension

Key
- Certain Attributes (or combination of attributes) can uniquely identify an object, these are keys

NULL
- Variant Records

What is a Key?

table structures(did, fid)

table folds(fid, bestrep, N_hlx, N_beta, name)

gid -> many matches

gid,TrgStrt -> unique match (one tuple)

thus, primary key gid,TrgStrt

gid,TrgStop -> unique match as well

fid -> many did’s, but did -> one fid

thus, primary key did

one-to-one between fid and name

SQL Select on a Single Table

Select {columns} from {a table} where {row-selection is true}

projection of a selection

Sort result on a attribute

SQL Select on a Single Table, Example

Select * from matches where gid= HI0016

HI0016 1 173 d1dar_2 2e-07

HI0016 179 274 d1dar_1 8.5e-06

HI0016 399 476 d1dar_4 0.00031

Select * from matches where gid= HI0016 and TrgStrt=179

HI0016 179 274 d1dar_1 8.5e-06

HI0299 119 135 d193l__ 3.1

HI0572 180 240 d1aba__ 0.0032

HI0989 56 125 d1aco_1 0.0049

HI0349 1 183 d1aky__ 7.6e-36

HI1309 35 52 d1alo_3 1.1

HI0589 8 25 d1alo_3 1.8

HI1358 239 444 d1amg_2 0.002

HI0016 1 173 d1dar_2 2e-07

HI0016 179 274 d1dar_1 8.5e-06

HI0016 399 476 d1dar_4 0.00031

HI0460 20 24 d1ans__ 1.8

HI1386 139 147 d1ans__ 3.3

HI0421 11 14 d1ans__ 6.4

HI0361 285 295 d1ans__ 8.2

HI0835 100 106 d1ans__ 9.7

SQL Select on a Single Table, Example 2

Select did from matches where score < 0.0001

d1aky__, d1dar_2, d1dar_1

HI0349 1 183 d1aky__ 7.6e-36

I0016 1 173 d1dar_2 2e-07

HI0016 179 274 d1dar_1 8.5e-06

HI0299 119 135 d193l__ 3.1

HI0572 180 240 d1aba__ 0.0032

HI0989 56 125 d1aco_1 0.0049

HI0349 1 183 d1aky__ 7.6e-36

HI1309 35 52 d1alo_3 1.1

HI0589 8 25 d1alo_3 1.8

HI1358 239 444 d1amg_2 0.002

HI0016 1 173 d1dar_2 2e-07

HI0016 179 274 d1dar_1 8.5e-06

HI0016 399 476 d1dar_4 0.00031

HI0460 20 24 d1ans__ 1.8

HI1386 139 147 d1ans__ 3.3

HI0421 11 14 d1ans__ 6.4

HI0361 285 295 d1ans__ 8.2

HI0835 100 106 d1ans__ 9.7

Joins

HI0299 119 135 d193l__ 3.1

HI0572 180 240 d1aba__ 0.0032

HI0989 56 125 d1aco_1 0.0049

HI0988 106 458 d1aco_2 4.4e-14

HI0154 2 76 d1acp__ 1.2e-23

HI1633 2 432 d1adea_ 0

HI0349 1 183 d1aky__ 7.6e-36

HI1309 35 52 d1alo_3 1.1

HI0589 8 25 d1alo_3 1.8

HI1358 239 444 d1amg_2 0.002

HI1358 218 410 d1amy_2 0.00037

HI0460 20 24 d1ans__ 1.8

HI1386 139 147 d1ans__ 3.3

HI0421 11 14 d1ans__ 6.4

HI0361 285 295 d1ans__ 8.2

HI0835 100 106 d1ans__ 9.7

d2rs51_ 1.002.007

d1imr__ 1.010.002

d1pyib1 1.007.030

d1dxtd_ 1.001.001

d181l__ 1.004.002

d1vmoa_ 1.002.044

d2gsq_1 1.001.031

d1etb2_ 1.002.003

d1guha1 1.001.031

d1hrc__ 1.001.003

d150lc_ 1.004.002

d1dmf__ 1.007.035

d1l19__ 1.004.002

d1yrnc_ 1.010.002

d1ans__ 1.007.008

d2rmai_ 1.002.036

1.001.001 d1flp__ 8 0 Globin-like

1.001.002 d1hdj__ 4 0 Long alpha-hairpin

1.001.003 d1ctj__ 9 0 Cytochrome c

1.001.004 d1enh__ 2 0 DNA-binding 3-helical bundle

1.001.005 d1dtr_2 1 3 Diphtheria toxin repressor (DtxR) dimeriz

1.001.006 d1tns__ 1 2 Mu transposase, DNA-binding domain

1.001.007 d2spca_ 0 2 Spectrin repeat unit

1.001.008 d1bdd__ 0 4 Immunoglobulin-binding protein A modules

1.007.008 d1qkt__ 4 3 Neurotoxin III (ATX III)

1.001.010 d2erl__ 3 5 Protozoan pheromone proteins

SQL Select on Multiple Tables

Select * from matches, structures, foldswherematches.gid = HI0361and matches.did=structures.didand structures.fid = folds.fid

Returnsmatches | structures | folds HI0361,285,295,d1ans__ ,8.2 | d1ans__,1.007.008 | 1.007.008,d1qkt__,4, 3,Neurotoxin III ...

Select score,name from matches, structures, folds where gid = HI0361and matches.did=structures.didand structures.fid = folds.fid 8.2, Neurotoxin III ...

Foreign Key

HI0299 119 135 d193l__ 3.1

HI0572 180 240 d1aba__ 0.0032

HI0989 56 125 d1aco_1 0.0049

HI0988 106 458 d1aco_2 4.4e-14

HI0154 2 76 d1acp__ 1.2e-23

HI1633 2 432 d1adea_ 0

HI0349 1 183 d1aky__ 7.6e-36

HI1309 35 52 d1alo_3 1.1

HI0589 8 25 d1alo_3 1.8

HI1358 239 444 d1amg_2 0.002

HI1358 218 410 d1amy_2 0.00037

HI0460 20 24 d1ans__ 1.8

HI1386 139 147 d1ans__ 3.3

HI0421 11 14 d1ans__ 6.4

HI0361 285 295 d1ans__ 8.2

HI0835 100 106 d1ans__ 9.7

d2rs51_ 1.002.007

d1imr__ 1.010.002

d1pyib1 1.007.030

d1dxtd_ 1.001.001

d181l__ 1.004.002

d1vmoa_ 1.002.044

d2gsq_1 1.001.031

d1etb2_ 1.002.003

d1guha1 1.001.031

d1hrc__ 1.001.003

d150lc_ 1.004.002

d1dmf__ 1.007.035

d1l19__ 1.004.002

d1yrnc_ 1.010.002

d1ans__ 1.007.008

d2rmai_ 1.002.036

Selection as Array Lookup

Same for a fold identifier from a structure id
- $fid=$structure{$did}
- (perl pseudo-code)

Same for matches and folds tables, but this time arrays return multiple values and have multiple field keys
- ($bestrep, $N_hlx, $N_beta, $name) = $folds{$fid}
- ($TrgStop,$did,$score)=$match{$gid,$TrgStrt}

Joining as a double-lookup
- $did = 1mbd__($bestrep, $N_hlx, $N_beta, $name) = $folds{ $structures{$did} }
- Select bestrep,N_hlx,N_beta,name from structures, folds where structures.fid = folds.fid and structures.did = 1mbd__

SQL Select on Multiple Tables

Select {columns} from {huge cross-product of tables} where {row-selection is true}
- cross-product T(1) x T(2) builds a huge virtual table where every row of T(1) is paired with every row of T(2). Then perform selection on this.

Select fid from matches,structures where gid=HI009 and matches.did = structures.did

Cross Product A x B

A has N rowsand C columns

B(1) = Row 1 of Table BB(2) = Row 2 of Table BB(i) = Row i of Table B

B has M rowsand K columns

A x B hasN x M rows and C+K columns

ER-diagrams

Korth & Silberschatz
- branch <=> matches (gid-start +++ did)
- customer <=> folds (fid +++)
- linked by account <=> structures (did fid)

Aggregate Functions--Statistics on Attributes

Query Statistics
- select gid, count (distinct did) from matches
- select max(N_hlx) from folds where N_beta = 0

How many matches to globins in the E. coli genome

Complex Query by nesting selections
- F <= select fid from folds where name contains “globin”
- D <= select did from structures where fid in F
- N <= select count(distinct gid,TrgStrt) from matcheswhere did in D and score < .01

Joins

HI0299 119 135 d193l__ 3.1

HI0572 180 240 d1aba__ 0.0032

HI0989 56 125 d1aco_1 0.0049

HI0988 106 458 d1aco_2 4.4e-14

HI0154 2 76 d1acp__ 1.2e-23

HI1633 2 432 d1adea_ 0

HI0349 1 183 d1aky__ 7.6e-36

HI1309 35 52 d1alo_3 1.1

HI0589 8 25 d1alo_3 1.8

HI1358 239 444 d1amg_2 0.002

HI1358 218 410 d1amy_2 0.00037

HI0460 20 24 d1ans__ 1.8

HI1386 139 147 d1ans__ 3.3

HI0421 11 14 d1ans__ 6.4

HI0361 285 295 d1ans__ 8.2

HI0835 100 106 d1ans__ 9.7

d2rs51_ 1.002.007

d1imr__ 1.010.002

d1pyib1 1.007.030

d1dxtd_ 1.001.001

d181l__ 1.004.002

d1vmoa_ 1.002.044

d2gsq_1 1.001.031

d1etb2_ 1.002.003

d1guha1 1.001.031

d1hrc__ 1.001.003

d150lc_ 1.004.002

d1dmf__ 1.007.035

d1l19__ 1.004.002

d1yrnc_ 1.010.002

d1ans__ 1.007.008

d2rmai_ 1.002.036

1.001.001 d1flp__ 8 0 Globin-like

1.001.002 d1hdj__ 4 0 Long alpha-hairpin

1.001.003 d1ctj__ 9 0 Cytochrome c

1.001.004 d1enh__ 2 0 DNA-binding 3-helical bundle

1.001.005 d1dtr_2 1 3 Diphtheria toxin repressor (DtxR) dimeriz

1.001.006 d1tns__ 1 2 Mu transposase, DNA-binding domain

1.001.007 d2spca_ 0 2 Spectrin repeat unit

1.001.008 d1bdd__ 0 4 Immunoglobulin-binding protein A modules

1.007.008 d1qkt__ 4 3 Neurotoxin III (ATX III)

1.001.010 d2erl__ 3 5 Protozoan pheromone proteins

Join Gives Unnormalized Table

HI0299 119 135 d193l__ 3.1 1.010.002 0 2 Spectrin repeat unit

HI0572 180 240 d1aba__ 0.0032 1.002.045 1 2 Mu transposase, DNA-binding domain

HI0989 56 125 d1aco_1 0.0049 1.001.031 8 0 Globin-like

HI0988 106 458 d1aco_2 4.4e-14 1.001.031 8 0 Globin-like

HI0154 2 76 d1acp__ 1.2e-23 1.001.031 8 0 Globin-like

HI1633 2 432 d1adea_ 0 1.010.002 0 2 Spectrin repeat unit

HI0349 1 183 d1aky__ 7.6e-36 1.001.031 8 0 Globin-like

HI1309 35 52 d1alo_3 1.1 1.007.008 4 3 Neurotoxin III (ATX III)

HI0589 8 25 d1alo_3 1.8 1.002.045 1 2 Mu transposase, DNA-binding domain

HI1358 239 444 d1amg_2 0.002 1.004.002 1 3 Diphtheria toxin repressor (DtxR)

HI1358 218 410 d1amy_2 0.00037 1.002.044 0 4 Immunoglobulin-binding protein A

HI0460 20 24 d1ans__ 1.8 1.007.008 4 3 Neurotoxin III (ATX III)

HI1386 139 147 d1ans__ 3.3 1.007.008 4 3 Neurotoxin III (ATX III)

HI0421 11 14 d1ans__ 6.4 1.007.008 4 3 Neurotoxin III (ATX III)

HI0361 285 295 d1ans__ 8.2 1.007.008 4 3 Neurotoxin III (ATX III)

HI0835 100 106 d1ans__ 9.7 1.007.008 4 3 Neurotoxin III (ATX III)

Normalization