Summary of New Masking Methods

In the revised TM indentification, we been using a cutoff of -2 kcal/mole in the GES scale to discriminate between membrane and soluble proteins. We also partition low-complexity regions into two groups. This is described in this document.

The Current Masking Process

Currently, the annotation processes was divided into a number of parts. (The specific values quoted are for MG, but similar numbers are equally applicable for SC or EC.)

  1. PDB. These are residues in the known structures. In MG, using PSI-blast the number of a.a. masked has crept up to over 30%. These amino acids involve 296 distinct matches (segments) in 208 of the 468 MG sequences (44%). On average each match is about 175 a.a. in length.
  2. TMB. The most sure TM segments were next masked ("best"). These were segments of at least 20 residues with an average GES hydrophobicity better than -1 kcal/mole in a protein that had at least one 20 residue segment with an average GES hydrophobicity of better than -2 kcal/mole. (This is the Boyd and Beckwith MaxH approach to membrane protein identification). In MG, only about 8% of the residues are flagged as sure TM segments, but note that these occur in 18% of the sequences. Almost all of these sequences contain protein with two or more TM helices.
  3. SIG. Signal sequences were masked. These have the pattern of a charged residue within the first seven, followed by a stretch of 14 with an average hydrophobicity under the cutoff. In MG, the total amount of residues masked by the PDB, TMB, and SIG is 38.5% (.6% + 7.7% + 30.2%). These represent the surest part of the annotation.
  4. LCV. Low-complexity regions were annotated next. Stretches of low complexity sequence are thought not to fold into globular protein structures. They may correspond to fibrous or disordered structures. Consequently, it is doubtful whether they will ever be crystallized. Low-complexity regions were identified with the SEG program using the standard parameters, a trigger complexity K(1) of 3.4, an extension complexity K(2) of 3.75, and a window of length 45. These parameters are the ones used to find "long" domain-size low-complexity regions. Quite a few of these were found. This either means that there are many low complexity regions within the genome of MG or that the SEG program is finding many false positives. We first mask very long low-complexity regions -- LCVs.
  5. LNK. Segments of sequence already accounted for thus far -- i.e. PDB matches, low complexity, or transmembrane-helices -- are considered to be "characterized" regions. The average length of these regions is ~140 residues, and these segments make up ~58% of the total amino acids in a genome. Short sequences between characterized segments are considered to be linkers, loops or coils connecting known structural elements, whether membrane spanning helices or known globular domains. Over all the genomes, linker regions are about 16 residues in length and constitute ~5% of the total amino acids.
  6. TMS. Then the remainder of TM segments were annotated. These only have a single 20 residue segment better than the cutoff.
  7. LCM. Then the remainder of low-complexity segments were annotated.
  8. LN2. Linkers were found again.
  9. UCD. After the whole masking process is done, including finding the linkers, one is left with regions of sequence that have not been characterized at all in a structural sense. These "uncharacterized regions" presumably fold into soluble, globular protein structures. However, some of them could also be part of all-beta membrane proteins, such as porins. They provide a most suitable comparison for the PDB, which also consists (mostly) of soluble proteins with globular structures. Uncharacterized regions currently constitute about ~36.5% of the amino acids in the MG genome.