Summary of New Masking Methods
In the revised TM indentification, we been using a cutoff of
-2 kcal/mole in the GES scale to discriminate between membrane
and soluble proteins. We also partition low-complexity regions
into two groups. This is described in this document.
The Current Masking Process
Currently, the annotation processes was divided into a number
of parts. (The specific values quoted are for MG, but similar
numbers are equally applicable for SC or EC.)
- PDB. These are residues in the known structures.
In MG, using PSI-blast the number of a.a. masked has crept up
to over 30%. These amino acids involve 296 distinct matches (segments)
in 208 of the 468 MG sequences (44%). On average each match is
about 175 a.a. in length.
- TMB. The most sure TM segments were next masked
("best"). These were segments of at least 20 residues
with an average GES hydrophobicity better than -1 kcal/mole in
a protein that had at least one 20 residue segment with an average
GES hydrophobicity of better than -2 kcal/mole. (This is the
Boyd and Beckwith MaxH approach to membrane protein identification).
In MG, only about 8% of the residues are flagged as sure TM segments,
but note that these occur in 18% of the sequences. Almost all
of these sequences contain protein with two or more TM helices.
- SIG. Signal sequences were masked. These have
the pattern of a charged residue within the first seven, followed
by a stretch of 14 with an average hydrophobicity under the cutoff.
In MG, the total amount of residues masked by the PDB, TMB, and
SIG is 38.5% (.6% + 7.7% + 30.2%). These represent the surest
part of the annotation.
- LCV. Low-complexity regions were annotated
next. Stretches of low complexity sequence are thought not to
fold into globular protein structures. They may correspond to
fibrous or disordered structures. Consequently, it is doubtful
whether they will ever be crystallized. Low-complexity regions
were identified with the SEG program using the standard parameters,
a trigger complexity K(1) of 3.4, an extension complexity K(2)
of 3.75, and a window of length 45. These parameters are the
ones used to find "long" domain-size low-complexity
regions. Quite a few of these were found. This either means that
there are many low complexity regions within the genome of MG
or that the SEG program is finding many false positives. We first
mask very long low-complexity regions -- LCVs.
- LNK. Segments of sequence already accounted
for thus far -- i.e. PDB matches, low complexity, or transmembrane-helices
-- are considered to be "characterized" regions. The
average length of these regions is ~140 residues, and these segments
make up ~58% of the total amino acids in a genome. Short sequences
between characterized segments are considered to be linkers,
loops or coils connecting known structural elements, whether
membrane spanning helices or known globular domains. Over all
the genomes, linker regions are about 16 residues in length and
constitute ~5% of the total amino acids.
- TMS. Then the remainder of TM segments were
annotated. These only have a single 20 residue segment better
than the cutoff.
- LCM. Then the remainder of low-complexity segments
were annotated.
- LN2. Linkers were found again.
- UCD. After the whole masking process is done,
including finding the linkers, one is left with regions of sequence
that have not been characterized at all in a structural sense.
These "uncharacterized regions" presumably fold into
soluble, globular protein structures. However, some of them could
also be part of all-beta membrane proteins, such as porins. They
provide a most suitable comparison for the PDB, which also consists
(mostly) of soluble proteins with globular structures. Uncharacterized
regions currently constitute about ~36.5% of the amino acids
in the MG genome.