One would expect, and bioinformatics in fact operates on the premises
that similar sequences of proteins carry out similar functions while different
sequences carry out different functions. But there are exceptions to these
rules (Fig
1).
We addressed this issue by systematically looking at the relationship
between protein function and structure. We focused on annotated enzymes
in Swissprot classified with
an EC number in the ENZYME
database and relate these to structurally classified proteins in the SCOP
database. The enzymatic functions are classified into 207 categories, while
the SCOP database contains altogether 361 different folds (Table
1).
In total, there are 229 folds and 92 functions in our analysis, giving
rise to a maximum of 21068 (= 229 x 92) possible fold-function combinations
(and a minimum of 229). We actually observe 331 of these combinations (1.6%)
(Fig
2:PDF).
We found that certain broad classes of folds and functions are preferentially
associated with one another. For instance, enzymes, especially transferases
and hydrolases, are disproportionately associated with alpha/beta folds,
and non-enzymes with all-alpha folds (Table
2). These tendencies in the database overall are largely true of proteins
in specific genomes. In particular, we found little difference between
the distribution of fold-function combinations in yeast and E. coli
as compared to those in Swissprot (Fig
3).
We identified both the most versatile functions (Fig
5:PDF), i.e. those that are associated with the most folds and the
most versatile folds (Fig
4), associated with the most functions. The two most versatile enzymatic
functions are the hydro-lyases and the O-glycosyl glucosidases, which are
associated with 7 folds each. In similar fashion, we found that the five
most versatile folds are all mixed alpha-beta folds: the TIM-barrel, the
alpha-beta hydrolase, the P-loop containing NTP hydrolase, the Rossmann
fold and the Ferredoxin fold. These folds, especially the TIM-barrel, stand
out from the other folds as generic scaffolds, accommodating from five
to as many as 15 enzymatic functions as well as non-enzymatic ones. More
broadly, our analysis reveals that nearly half of the folds with enzymatic
activity have at least two different functions (53 out of 128) and that
an even higher proportion of enzymatic functions (51 out of 91) are carried
out by unrelated folds. We also calculated the proportion of multifunctional
domains, i.e. the number of those domains that are homologous to proteins
with different functions (Fig
6).
These observations for the database overall are largely true for specific
genomes. We focus, in particular, on yeast,
analyzing it using several functional (EC, COGS, MIPS) and structural
(SCOP and CATH) classification schemes. We found clear tendencies
for fold-function association, across a broad spectrum of functions. Analysis
with the COGs scheme also suggests that the functions of the most ancient
proteins are more evenly distributed among different structural classes
than those of more modern ones.
© 1998 Hedi
Hegyi & Mark Gerstein
© 1998 Graphics - Jimmy Lin
|