The relationship between protein structure and function: a comprehensive survey focusing on enzymes

One would expect, and bioinformatics in fact operates on the premises that similar sequences of proteins carry out similar functions while different sequences carry out different functions. But there are exceptions to these rules (Fig 1).

We addressed this issue by systematically looking at the relationship between protein function and structure. We focused on annotated enzymes in Swissprot classified with an EC number in the ENZYME database and relate these to structurally classified proteins in the SCOP database. The enzymatic functions are classified into 207 categories, while the SCOP database contains altogether 361 different folds (Table 1).

In total, there are 229 folds and 92 functions in our analysis, giving rise to a maximum of 21068 (= 229 x 92) possible fold-function combinations (and a minimum of 229). We actually observe 331 of these combinations (1.6%) (Fig 2:PDF).

We found that certain broad classes of folds and functions are preferentially associated with one another. For instance, enzymes, especially transferases and hydrolases, are disproportionately associated with alpha/beta folds, and non-enzymes with all-alpha folds (Table 2). These tendencies in the database overall are largely true of proteins in specific genomes. In particular, we found little difference between the distribution of fold-function combinations in yeast and E. coli as compared to those in Swissprot (Fig 3).

We identified both the most versatile functions (Fig 5:PDF), i.e. those that are associated with the most folds and the most versatile folds (Fig 4), associated with the most functions. The two most versatile enzymatic functions are the hydro-lyases and the O-glycosyl glucosidases, which are associated with 7 folds each. In similar fashion, we found that the five most versatile folds are all mixed alpha-beta folds: the TIM-barrel, the alpha-beta hydrolase, the P-loop containing NTP hydrolase, the Rossmann fold and the Ferredoxin fold. These folds, especially the TIM-barrel, stand out from the other folds as generic scaffolds, accommodating from five to as many as 15 enzymatic functions as well as non-enzymatic ones. More broadly, our analysis reveals that nearly half of the folds with enzymatic activity have at least two different functions (53 out of 128) and that an even higher proportion of enzymatic functions (51 out of 91) are carried out by unrelated folds. We also calculated the proportion of multifunctional domains, i.e. the number of those domains that are homologous to proteins with different functions (Fig 6).

These observations for the database overall are largely true for specific genomes. We focus, in particular, on yeast, analyzing it using several functional (EC, COGS, MIPS) and structural (SCOP and CATH) classification schemes. We found clear tendencies for fold-function association, across a broad spectrum of functions. Analysis with the COGs scheme also suggests that the functions of the most ancient proteins are more evenly distributed among different structural classes than those of more modern ones.