Supplementary Material on Datamining

for

D Christendat, A Yee, A Dharamsi, Y Kluger, A Savchenko, J R Cort, V Booth, C D Mackereth, V Saridakis, I Ekiel, G Kozlov, K L Maxwell, N Wu, L P. McIntosh, K Gehring, M A. Kennedy, A R Davidson, E F Pai, M Gerstein, A M Edwards & C H Arrowsmith.

"Structural Proteomics of an Archeon," citation and paper


Datasources

Current SPINE Database / NESG website
Frozen Version of SPINE Database (used for Datamining here)

Crystal Tree

[   ] crystal.tree.pdf        30-May-2000 07:44     3k  
[   ] crystal.tree.ps         30-May-2000 07:44     4k  

Decision tree for crystallizability. The number E/T denote the
proportion of the training cases reaching that node that are wrongly
classified by the label. The total number of instances T at a given
node is the sum of the correctly classified instances C plus the
incorrectly classified instances, the error E, such that T = C + E.

YES = "protein could be crystallized"
NO  = "protein could NOT be crystallized"

At the top, we have 63 cases =  24 YES + 39 NO.

At the next level on the left, we have 44 cases, 23 YES, 21 NO. 
At the next level on the right, we have 19 cases, 1 YES, 18 NO. 

At the third level, the leftmost two nodes are:
 7 cases =           7 NO
37 cases = 23 YES + 14 NO


Expression Tree

[   ] expression.tree.pdf 30-May-2000 07:44 13k [   ] expression.tree.ps 30-May-2000 07:44 23k

Solubility Tree

[IMG] solubility.GIF 30-May-2000 17:36 62k
(Figure 3 from the paper.) A decision tree for discriminating between soluble and insoluble proteins. The nodes of the tree are represented by ellipses (intermediate nodes) and rectangles (final nodes or leaves). The numbers on the left of each node denote the number of insoluble proteins in the node, and are proportional to the node's dark area. Similarly, the numbers on the right denote the soluble proteins and are proportional to the white area. Under each intermediate node, the decision tree algorithm calculates all possible splitting thresholds for each of 53 variables (hydrophobicity, amino acid composition, etc.). It picks the optimal splitting variable and its threshold, in order for at least one of the two daughter nodes to be as homogeneous as possible. When a variable v is split, vthreshold is the right branch. The specific parameters used at each nodes and their thresholds for the right branches shown in the graph are in descending order (from top root to bottom leaves): hydrophobe > 0.85 kcal/mole (where "hydrophobe" represents the average GES hydrophobicity of a sequence stretch, as discussed in the text - the higher this value the lower is the energy transfer); cplx>0.28 (a measure of a short complexity region based on the SEG program); Gln composition> 4%; Asp+Glu composition >17%; Ile-composition>5.6%; Phe+Tyr+Trp composition >7.5%; Asp+Glu composition >13.6%; Gly+Ala+Val+Leu+Ile composition >42%; hphobe> 0.01 kcal/mole; His+Lys+Arg composition> 12%; Trp composition > 1.2%; and alpha-helical secondary structure composition > 58%. Note that two of the variables are conditioned on more than once (hphobe, Asp+Glu). The highlighted decision pathways (in color, discussed in the NSB text) terminate in highly homogeneous nodes (mostly dark is insoluble, mostly white is soluble). The shorter the decision pathway and the larger the number of cases in the terminal node, the less likely it is to over-fit the data, and therefore that decision pathway can be used for future predictions. Heterogeneous leaves could be further split (dotted lines) improving the error rate but risking over-fitting of the training set.