For Genome MG, Tables with Specific Analysis

Table Name Size (kb), Format Links Fields (keys bold) Description
tm segs 12 k, tab delim. data, head id_, start_I, stop_n, sumscor, energy_f

Transmembrane segments. 
(version 2, revised 971113)
(version 3, revised 981127, now sumscor based on calc_istm_score)
sumscor gives a confidence value in the TM helix based on an analysis
of the TM helices in the WHOLE protein.
   #
   # These parameters were refined on MG
   # see genomes/mg-analyze-maxh-981127.xls
   # 
   my  = (minhall<-2 ? 4 :
		 (tot_aa > 50 ? 3 : 
		  ( minhall <-1.75 ? 2 :
		    ( tot_aa > 20 ? 1 : 0))));


signal segs 2 k, tab delim. data, head id_, start_I, stop_n

  Signal sequences.


sat mg strucs 12 k, tab delim. data, head

Sara look at this!
Reformated version of 
SAT's http://www.mrc-lmb.cam.ac.uk/genomes/MG_strucs.html
by MBG. 
Here is the readme for the original.  "_" or "?" was used for
unidentifiable fields.
Explanation of format:
     (MG sequence number)-(sequence length) (MG sequence region) (scop sequence name)-(sequence length) (sequence
     region) (expectation value with which found) 
     If there is an M at the end of the line, the relationship was only found starting the search from the MG sequence. 
     If the information 'via sequence name' is given, the relationship is not found with PSI-BLAST, but only via another
     sequence in the GEANFAMMER sequence family of that sequence. 
     If there is a * at the end of the line, the expectation value is below that considered significant, but the match is accepted for
     other reasons. 
MG001-267 149-266 pdb_d2pola3-122 4-119 8e-25
MG002-310 3-64 pdb_d1xbl__-75 4-68 5e-16
MG003-650 412-640 pdb_d1bgw__-680 1-243 4e-79
MG003-650 96-215 pdb_d1ah6__-213 89-212 8e-20
MG004-836 24-498 pdb_d1bgw__-680 206-677 1e-128
MG005-417 5-110 pdb_d1seta1-110 4-110 4e-25
MG005-417 114-413 pdb_d1seta2-311 3-303 1e-109
MG006-210 1-197 pdb_3adk__-194 5-186 8e-34 M
.
.
.


minscop soluble matches no overlap 9 k, tab delim. data, head gid_, TargStart_I, TargStop_n, did, fids, QryStart_n, QryStop_n, ev_f, swsc_n, swid_f

These are the good matches to an e-value cutoff of .01 
for just the soluble proteins, scop classes 1-5,7
This table is the result of filtering out the matches from 
minscop_soluble_matches that hit the same sequence on the genome. 


id ntm 5 k, tab delim. data, head id_, signalp, ntm_n

This table contains the number of transmembrane segments for each ORF.
Its definition of TM-segment is after filtering. 
It also has signal sequence data, based on simple criteria. 


genome v minscop 14 k, tab delim. data, head did_, gid_, TargStart_I, TargStop_n, QryStart_n, QryStop_n, ev_f, swsc_n, swid_f, comment

This is custom made up file based on the values in the table
sat_mg_strucs_nov98.txt. It was constructed by MBG on 981127. It has
all the original fields, plus a comment.  The original matches are
from the web page at the MRC LMB.  The beginning of this page is
reproduced below.
--
SCOP Domain Sequences in the MG Genome
(Additional information for "Structural assignments to the Mycoplasma genitalium proteins show extensive gene duplications and domain
rearrangements" (1998) by Sarah A. Teichmann, Jong Park and Cyrus Chothia, Proc. Natl. Acad. Sci. USA, 95, 14658-14663)
Explanation of format:
     (MG sequence number)-(sequence length) (MG sequence region) (scop sequence name)-(sequence length)
     (sequence region) (expectation value with which found) 
     If there is an M at the end of the line, the relationship was only found starting the search from the MG sequence. 
     If the information 'via sequence name' is given, the relationship is not found with PSI-BLAST, but only via another
     sequence in the GEANFAMMER sequence family of that sequence. 
     If there is a * at the end of the line, the expectation value is below that considered significant, but the match is
     accepted for other reasons. 
MG001-267 149-266 pdb_d2pola3-122 4-119 8e-25
MG002-310 3-64 pdb_d1xbl__-75 4-68 5e-16
MG002-310 122-211 pdb_d1tbd__-134 7-91 5e-7
MG003-650 412-640 pdb_d1bgw__-680 1-243 4e-79
MG003-650 229-410 pdb_ds043_1-172 1-172 4e-54
MG003-650 96-215 pdb_d1ah6__-213 89-212 8e-20
MG004-836 24-498 pdb_d1bgw__-680 206-677 1e-128
MG005-417 5-110 pdb_d1seta1-110 4-110 4e-25
MG005-417 114-413 pdb_d1seta2-311 3-303 1e-109


fold occurrence 4 k, tab delim. data, head fold_, count

Number of times each fold (represented by two scop fid numbers) occurs in genome MG
This table should be sorted into a standard order.


all masks 89 k, tab delim. data, head gid_, start_I, stop_n, tool_, score

This file concatenates the results of 
creating all the masks for genome MG. 


aafreq histo 1 k, tab delim. data, head aa_, freq_n

Histogram of frequency of the various amino acids


alla segs 2 k, tab delim. data, head id_, start_I, stop_n

all-a segments


allb segs 2 k, tab delim. data, head id_, start_I, stop_n

all-b segments


characterized domains 1 8 k, tab delim. data, head id_, start_I, stop_n

Already characterized domains (the borders between
linker regions).
This is done at phase _1.
generate_linkers() running on MG/seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv with tag _1 


characterized domains 2 7 k, tab delim. data, head id_, start_I, stop_n

Already characterized domains (the borders between
linker regions).
This is done at phase _2.
generate_linkers() running on MG/seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk_MBY_tms_MBY_lcm with tag _2 


comp report 13 k, tab delim. data, head selection, genome, sum, total_seqs, masked_seqs, total_chars, masked_chars, total_segs, masking_segs, mask_chars_per_seg, mask_chars_per_seq, frac_masked_chars, frac_masked_seqs, masking_segs_per_seq, dav_rms, dps_rms, dav_A, dav_C, dav_D, dav_E, dav_F, dav_G, dav_H, dav_I, dav_K, dav_L, dav_M, dav_N, dav_P, dav_Q, dav_R, dav_S, dav_T, dav_V, dav_W, dav_Y, dps_A, dps_C, dps_D, dps_E, dps_F, dps_G, dps_H, dps_I, dps_K, dps_L, dps_M, dps_N, dps_P, dps_Q, dps_R, dps_S, dps_T, dps_V, dps_W, dps_Y, pct_A, pct_C, pct_D, pct_E, pct_F, pct_G, pct_H, pct_I, pct_K, pct_L, pct_M, pct_N, pct_P, pct_Q, pct_R, pct_S, pct_T, pct_V, pct_W, pct_Y, A_n, C_n, D_n, E_n, F_n, G_n, H_n, I_n, K_n, L_n, M_n, N_n, P_n, Q_n, R_n, S_n, T_n, V_n, W_n, Y_n, 0_n, 1_n, 2_n, 3_n, 4_n, 5_n, 6_n, 7_n, 8_n, 9_n, selections_long, sort

Report on the compositions in the genomes
MG
based on the following a.a. selections 
seq pdb tmb sig lcv lnk tms lcm ln2 fun nof ucd uc2 fasta__ sat9811 psib1wy 
(version 3)


full len segs 6 k, tab delim. data, head id_, start_I, stop_n

Full length segments.


genome v minscop fasta 7 k, tab delim. data, head gid_, TargStart_I, TargStop_n, did, fids, QryStart_n, QryStop_n, ev_f, swsc_n, swid_f

These are the good matches to an e-value cutoff of .01 
for just the soluble proteins, scop classes 1-5,7
These are the good matches generated by FASTA against scop 1.35.
These are no longer being used in the analysis but are here for
comparative purposes.


genome v minscop psib1way 10 k, tab delim. data, head



genome v minscop sat9811 14 k, tab delim. data, head did_, gid_, TargStart_I, TargStop_n, QryStart_n, QryStop_n, ev_f, swsc_n, swid_f, comment

This is custom made up file based on the values in the table
sat_mg_strucs_nov98.txt. It was constructed by MBG on 981127. It has
all the original fields, plus a comment.  The original matches are
from the web page at the MRC LMB.  The beginning of this page is
reproduced below.
1998.12.14:
d1dts__	MG080	707	847	5	173	2.00E-07	_	_	(mbg fixed)
--
SCOP Domain Sequences in the MG Genome
(Additional information for "Structural assignments to the Mycoplasma genitalium proteins show extensive gene duplications and domain
rearrangements" (1998) by Sarah A. Teichmann, Jong Park and Cyrus Chothia, Proc. Natl. Acad. Sci. USA, 95, 14658-14663)
Explanation of format:
     (MG sequence number)-(sequence length) (MG sequence region) (scop sequence name)-(sequence length)
     (sequence region) (expectation value with which found) 
     If there is an M at the end of the line, the relationship was only found starting the search from the MG sequence. 
     If the information 'via sequence name' is given, the relationship is not found with PSI-BLAST, but only via another
     sequence in the GEANFAMMER sequence family of that sequence. 
     If there is a * at the end of the line, the expectation value is below that considered significant, but the match is
     accepted for other reasons. 
MG001-267 149-266 pdb_d2pola3-122 4-119 8e-25
MG002-310 3-64 pdb_d1xbl__-75 4-68 5e-16
MG002-310 122-211 pdb_d1tbd__-134 7-91 5e-7
MG003-650 412-640 pdb_d1bgw__-680 1-243 4e-79
MG003-650 229-410 pdb_ds043_1-172 1-172 4e-54
MG003-650 96-215 pdb_d1ah6__-213 89-212 8e-20
MG004-836 24-498 pdb_d1bgw__-680 206-677 1e-128
MG005-417 5-110 pdb_d1seta1-110 4-110 4e-25
MG005-417 114-413 pdb_d1seta2-311 3-303 1e-109


genome v minscop sep98 12 k, tab delim. data, head did_, gid_, TargStart_I, TargStop_n, QryStart_n, QryStop_n, ev_f, swsc_n, swid_f, comment

This is custom made up file based on the values
in the table sat_mg_strucs.txt.
This was done by MBG on 980907. 
It has all the original fields, plus a comment.  


gorss 173 k, fasta data, head gid_, gorss

This fasta file is the result of running GOR sec. struc. prediction
on the genome MG 


gorss MBY nul 173 k, fasta data, head gid_, masked_seq

This fasta file is the result of masking gorss 
with the mask full_len_segs


gorss MBY nul COMP 1 k, tab delim. data, head aa_, count_n

This is the aa composition of the 
masked file from masking the fasta file gorss with
the mask full_len_segs to generate the masked fasta file gorss_MBY_nul.


gorss MBY nul STAT 1 k, tab delim. data, head stat_, value

This are the statistics from masking the fasta file gorss with
the mask full_len_segs to generate the masked fasta file gorss_MBY_nul.
MASKED_CHARS  = number of characters masked with the application of this mask.
Masked_Seqs   = number of sequences masked with the application of this mask.
Masking_Segs  = number of segments used in the application of the mask


gorss MBY ucd 173 k, fasta data, head gid_, masked_seq

This fasta file is the result of masking gorss 
with the mask unchar_domains


gorss MBY ucd COMP 1 k, tab delim. data, head aa_, count_n

This is the aa composition of the 
masked file from masking the fasta file gorss with
the mask unchar_domains to generate the masked fasta file gorss_MBY_ucd.


gorss MBY ucd STAT 1 k, tab delim. data, head stat_, value

This are the statistics from masking the fasta file gorss with
the mask unchar_domains to generate the masked fasta file gorss_MBY_ucd.
MASKED_CHARS  = number of characters masked with the application of this mask.
Masked_Seqs   = number of sequences masked with the application of this mask.
Masking_Segs  = number of segments used in the application of the mask


hlx aa pair freq 20 k, tab delim. data, head aa_, offset_, count

results of all counts of pairs


id ntm nofilt 5 k, tab delim. data, head id_, signalp, ntm_n

  This table contains data on whether there is a signal sequence
  and the number of transmembrane segments.
  (version 2, revised 971113).
  (Renamed table on 980101: id_ntm --> id_ntm_nofilt)


lcm segs 6 k, tab delim. data, head gid_, start_, stop, score

This routine splits the low_complexity_long (LCL) regions into the LCV
(low-complexity very-long) and LCM (low-complexity medium). The
criteria is simply length. LCVs are LCLs longer than 150 aa. LCMs are
shorter.


lcm segs.txt 5 k, tab delim. data, head gid_, start_, stop, score

This routine splits the low_complexity_long (lcl) regions into the lcv
(low-complexity very long) and lcm (low-complexity medium). The
criteria is simply length. LCVs are LCLs longer than 150 aa. LCMs are
shorter.


lcv segs 2 k, tab delim. data, head gid_, start_, stop, score

This routine splits the low_complexity_long (LCL) regions into the LCV
(low-complexity very-long) and LCM (low-complexity medium). The
criteria is simply length. LCVs are LCLs longer than 150 aa. LCMs are
shorter.


lcv segs.txt 1 k, tab delim. data, head gid_, start_, stop, score

This routine splits the low_complexity_long (lcl) regions into the lcv
(low-complexity very long) and lcm (low-complexity medium). The
criteria is simply length. LCVs are LCLs longer than 150 aa. LCMs are
shorter.


linkers 1 8 k, tab delim. data, head id_, start_I, stop_n

Linker regions between two other defined segments, 
which are less in length than 50 
This is done at phase _1.
generate_linkers() running on MG/seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv with tag _1 


linkers 2 2 k, tab delim. data, head id_, start_I, stop_n

Linker regions between two other defined segments, 
which are less in length than 50 
This is done at phase _2.
generate_linkers() running on MG/seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk_MBY_tms_MBY_lcm with tag _2 


low complexity long 7 k, tab delim. data, head id_, start_I, stop_n, cplxity_f

Low complexity regions generated with the
following seg command: seg/seg tmp.fa 45 3.4 3.75 -l


low complexity short 13 k, tab delim. data, head id_, start_I, stop_n, cplxity_f

Low complexity regions generated with the
following seg command: seg/seg tmp.fa 25 3.0 3.3 -l


minscop occurrence 10 k, tab delim. data, head did_, count

Number of times each minscop domain id (did) occurs in genome MG
This table should be sorted into a standard order and contain 990 entries. 


minscop soluble matches 18 k, tab delim. data, head gid_, TargStart_I, TargStop_n, did, fids, QryStart_n, QryStop_n, ev_f, swsc_n, swid_f

These are the good matches to an e-value cutoff of .01 
for just the soluble proteins, scop classes 1-5,7
This is with a year cutoff of 97
good_scop_matches_to_mask_w_yr() running on genome MG...
...with year cutoff of 97 and table genome_v_minscop 


minscop soluble matches overlap 1 k, tab delim. data, head gid_, TargStart_I, TargStop_n, did, fids, QryStart_n, QryStop_n, ev_f, swsc_n, swid_f

These are the good matches to an e-value cutoff of .01 
for just the soluble proteins, scop classes 1-5,7
This table is the matches from 
minscop_soluble_matches that hit the same sequence on the genome. 
That is, it contains duplicate matches that should not be used. 


null mask 1 k, tab delim. data, head



pdb40d135 soluble matches 31 k, tab delim. data, head gid_, TargStart_I, TargStop_n, did, fids, QryStart_n, QryStop_n, ev_f, swsc_n, swid_f

These are the good matches to an e-value cutoff of .01 
for just the soluble proteins, scop classes 1-5,7


seq 193 k, fasta data, head



seq MBY cdo 173 k, fasta data, head gid_, masked_seq

This fasta file is the result of masking seq 
with the mask characterized_domains


seq MBY cdo COMP 1 k, tab delim. data, head aa_, count_n

This is the aa composition of the 
masked file from masking the fasta file seq with
the mask characterized_domains to generate the masked fasta file seq_MBY_cdo.


seq MBY cdo STAT 1 k, tab delim. data, head stat_, value

This are the statistics from masking the fasta file seq with
the mask characterized_domains to generate the masked fasta file seq_MBY_cdo.
MASKED_CHARS  = number of characters masked with the application of this mask.
Masked_Seqs   = number of sequences masked with the application of this mask.
Masking_Segs  = number of segments used in the application of the mask


seq MBY lcs 4 k, Bad! data, head gid_, masked_seq

This fasta file is the result of masking seq 
with the mask low_complexity_short


seq MBY lcs COMP 1 k, tab delim. data, head aa_, count_n

This is the aa composition of the 
masked file from masking the fasta file seq with
the mask low_complexity_short to generate the masked fasta file seq_MBY_lcs.


seq MBY lcs STAT 1 k, tab delim. data, head stat_, value

This are the statistics from masking the fasta file seq with
the mask low_complexity_short to generate the masked fasta file seq_MBY_lcs.


seq MBY lnk 173 k, fasta data, head gid_, masked_seq

This fasta file is the result of masking seq 
with the mask linkers


seq MBY lnk COMP 1 k, tab delim. data, head aa_, count_n

This is the aa composition of the 
masked file from masking the fasta file seq with
the mask linkers to generate the masked fasta file seq_MBY_lnk.


seq MBY lnk STAT 1 k, tab delim. data, head stat_, value

This are the statistics from masking the fasta file seq with
the mask linkers to generate the masked fasta file seq_MBY_lnk.
MASKED_CHARS  = number of characters masked with the application of this mask.
Masked_Seqs   = number of sequences masked with the application of this mask.
Masking_Segs  = number of segments used in the application of the mask


seq MBY nul 173 k, fasta data, head gid_, masked_seq

This fasta file is the result of masking seq 
with the mask full_len_segs


seq MBY nul COMP 1 k, tab delim. data, head aa_, count_n

This is the aa composition of the 
masked file from masking the fasta file seq with
the mask full_len_segs to generate the masked fasta file seq_MBY_nul.


seq MBY nul STAT 1 k, tab delim. data, head stat_, value

This are the statistics from masking the fasta file seq with
the mask full_len_segs to generate the masked fasta file seq_MBY_nul.
MASKED_CHARS  = number of characters masked with the application of this mask.
Masked_Seqs   = number of sequences masked with the application of this mask.
Masking_Segs  = number of segments used in the application of the mask


seq MBY pdb 173 k, fasta data, head gid_, masked_seq

This fasta file is the result of masking seq 
with the mask minscop_soluble_matches


seq MBY pdb COMP 1 k, tab delim. data, head aa_, count_n

This is the aa composition of the 
masked file from masking the fasta file seq with
the mask minscop_soluble_matches to generate the masked fasta file seq_MBY_pdb.


seq MBY pdb MBY tmb 173 k, fasta data, head gid_, masked_seq

This fasta file is the result of masking seq_MBY_pdb 
with the mask tm_segs_best


seq MBY pdb MBY tmb COMP 1 k, tab delim. data, head aa_, count_n

This is the aa composition of the 
masked file from masking the fasta file seq_MBY_pdb with
the mask tm_segs_best to generate the masked fasta file seq_MBY_pdb_MBY_tmb.


seq MBY pdb MBY tmb MBY sig 173 k, fasta data, head gid_, masked_seq

This fasta file is the result of masking seq_MBY_pdb_MBY_tmb 
with the mask signal_segs


seq MBY pdb MBY tmb MBY sig COMP 1 k, tab delim. data, head aa_, count_n

This is the aa composition of the 
masked file from masking the fasta file seq_MBY_pdb_MBY_tmb with
the mask signal_segs to generate the masked fasta file seq_MBY_pdb_MBY_tmb_MBY_sig.


seq MBY pdb MBY tmb MBY sig MBY lcl 173 k, fasta data, head gid_, masked_seq

This fasta file is the result of masking seq_MBY_pdb_MBY_tmb_MBY_sig 
with the mask low_complexity_long


seq MBY pdb MBY tmb MBY sig MBY lcl COMP 1 k, tab delim. data, head aa_, count_n

This is the aa composition of the 
masked file from masking the fasta file seq_MBY_pdb_MBY_tmb_MBY_sig with
the mask low_complexity_long to generate the masked fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcl.


seq MBY pdb MBY tmb MBY sig MBY lcl MBY tms 173 k, fasta data, head gid_, masked_seq

This fasta file is the result of masking seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcl 
with the mask tm_segs


seq MBY pdb MBY tmb MBY sig MBY lcl MBY tms COMP 1 k, tab delim. data, head aa_, count_n

This is the aa composition of the 
masked file from masking the fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcl with
the mask tm_segs to generate the masked fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcl_MBY_tms.


seq MBY pdb MBY tmb MBY sig MBY lcl MBY tms MBY lnk 173 k, fasta data, head gid_, masked_seq

This fasta file is the result of masking seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcl_MBY_tms 
with the mask linkers


seq MBY pdb MBY tmb MBY sig MBY lcl MBY tms MBY lnk COMP 1 k, tab delim. data, head aa_, count_n

This is the aa composition of the 
masked file from masking the fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcl_MBY_tms with
the mask linkers to generate the masked fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcl_MBY_tms_MBY_lnk.


seq MBY pdb MBY tmb MBY sig MBY lcl MBY tms MBY lnk STAT 1 k, tab delim. data, head stat_, value

This are the statistics from masking the fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcl_MBY_tms with
the mask linkers to generate the masked fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcl_MBY_tms_MBY_lnk.
MASKED_CHARS  = number of characters masked with the application of this mask.
Masked_Seqs   = number of sequences masked with the application of this mask.
Masking_Segs  = number of segments used in the application of the mask


seq MBY pdb MBY tmb MBY sig MBY lcl MBY tms STAT 1 k, tab delim. data, head stat_, value

This are the statistics from masking the fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcl with
the mask tm_segs to generate the masked fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcl_MBY_tms.
MASKED_CHARS  = number of characters masked with the application of this mask.
Masked_Seqs   = number of sequences masked with the application of this mask.
Masking_Segs  = number of segments used in the application of the mask


seq MBY pdb MBY tmb MBY sig MBY lcl STAT 1 k, tab delim. data, head stat_, value

This are the statistics from masking the fasta file seq_MBY_pdb_MBY_tmb_MBY_sig with
the mask low_complexity_long to generate the masked fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcl.
MASKED_CHARS  = number of characters masked with the application of this mask.
Masked_Seqs   = number of sequences masked with the application of this mask.
Masking_Segs  = number of segments used in the application of the mask


seq MBY pdb MBY tmb MBY sig MBY lcv 173 k, fasta data, head gid_, masked_seq

This fasta file is the result of masking seq_MBY_pdb_MBY_tmb_MBY_sig 
with the mask lcv_segs


seq MBY pdb MBY tmb MBY sig MBY lcv COMP 1 k, tab delim. data, head aa_, count_n

This is the aa composition of the 
masked file from masking the fasta file seq_MBY_pdb_MBY_tmb_MBY_sig with
the mask lcv_segs to generate the masked fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv.


seq MBY pdb MBY tmb MBY sig MBY lcv MBY lnk 173 k, fasta data, head gid_, masked_seq

This fasta file is the result of masking seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv 
with the mask linkers_1


seq MBY pdb MBY tmb MBY sig MBY lcv MBY lnk COMP 1 k, tab delim. data, head aa_, count_n

This is the aa composition of the 
masked file from masking the fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv with
the mask linkers_1 to generate the masked fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk.


seq MBY pdb MBY tmb MBY sig MBY lcv MBY lnk MBY tms 173 k, fasta data, head gid_, masked_seq

This fasta file is the result of masking seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk 
with the mask tm_segs


seq MBY pdb MBY tmb MBY sig MBY lcv MBY lnk MBY tms COMP 1 k, tab delim. data, head aa_, count_n

This is the aa composition of the 
masked file from masking the fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk with
the mask tm_segs to generate the masked fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk_MBY_tms.


seq MBY pdb MBY tmb MBY sig MBY lcv MBY lnk MBY tms MBY lcm 173 k, fasta data, head gid_, masked_seq

This fasta file is the result of masking seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk_MBY_tms 
with the mask lcm_segs


seq MBY pdb MBY tmb MBY sig MBY lcv MBY lnk MBY tms MBY lcm COMP 1 k, tab delim. data, head aa_, count_n

This is the aa composition of the 
masked file from masking the fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk_MBY_tms with
the mask lcm_segs to generate the masked fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk_MBY_tms_MBY_lcm.


seq MBY pdb MBY tmb MBY sig MBY lcv MBY lnk MBY tms MBY lcm MBY ln2 173 k, fasta data, head gid_, masked_seq

This fasta file is the result of masking seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk_MBY_tms_MBY_lcm 
with the mask linkers_2


seq MBY pdb MBY tmb MBY sig MBY lcv MBY lnk MBY tms MBY lcm MBY ln2 COMP 1 k, tab delim. data, head aa_, count_n

This is the aa composition of the 
masked file from masking the fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk_MBY_tms_MBY_lcm with
the mask linkers_2 to generate the masked fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk_MBY_tms_MBY_lcm_MBY_ln2.


seq MBY pdb MBY tmb MBY sig MBY lcv MBY lnk MBY tms MBY lcm MBY ln2 MBY fun 173 k, fasta data, head gid_, masked_seq

This fasta file is the result of masking seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk_MBY_tms_MBY_lcm_MBY_ln2 
with the mask unchar_domains_2_have_func


seq MBY pdb MBY tmb MBY sig MBY lcv MBY lnk MBY tms MBY lcm MBY ln2 MBY fun COMP 1 k, tab delim. data, head aa_, count_n

This is the aa composition of the 
masked file from masking the fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk_MBY_tms_MBY_lcm_MBY_ln2 with
the mask unchar_domains_2_have_func to generate the masked fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk_MBY_tms_MBY_lcm_MBY_ln2_MBY_fun.


seq MBY pdb MBY tmb MBY sig MBY lcv MBY lnk MBY tms MBY lcm MBY ln2 MBY fun MBY nof 173 k, fasta data, head gid_, masked_seq

This fasta file is the result of masking seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk_MBY_tms_MBY_lcm_MBY_ln2_MBY_fun 
with the mask unchar_domains_2_no_func


seq MBY pdb MBY tmb MBY sig MBY lcv MBY lnk MBY tms MBY lcm MBY ln2 MBY fun MBY nof COMP 1 k, tab delim. data, head aa_, count_n

This is the aa composition of the 
masked file from masking the fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk_MBY_tms_MBY_lcm_MBY_ln2_MBY_fun with
the mask unchar_domains_2_no_func to generate the masked fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk_MBY_tms_MBY_lcm_MBY_ln2_MBY_fun_MBY_nof.


seq MBY pdb MBY tmb MBY sig MBY lcv MBY lnk MBY tms MBY lcm MBY ln2 MBY fun MBY nof STAT 1 k, tab delim. data, head stat_, value

This are the statistics from masking the fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk_MBY_tms_MBY_lcm_MBY_ln2_MBY_fun with
the mask unchar_domains_2_no_func to generate the masked fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk_MBY_tms_MBY_lcm_MBY_ln2_MBY_fun_MBY_nof.
MASKED_CHARS  = number of characters masked with the application of this mask.
Masked_Seqs   = number of sequences masked with the application of this mask.
Masking_Segs  = number of segments used in the application of the mask


seq MBY pdb MBY tmb MBY sig MBY lcv MBY lnk MBY tms MBY lcm MBY ln2 MBY fun STAT 1 k, tab delim. data, head stat_, value

This are the statistics from masking the fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk_MBY_tms_MBY_lcm_MBY_ln2 with
the mask unchar_domains_2_have_func to generate the masked fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk_MBY_tms_MBY_lcm_MBY_ln2_MBY_fun.
MASKED_CHARS  = number of characters masked with the application of this mask.
Masked_Seqs   = number of sequences masked with the application of this mask.
Masking_Segs  = number of segments used in the application of the mask


seq MBY pdb MBY tmb MBY sig MBY lcv MBY lnk MBY tms MBY lcm MBY ln2 STAT 1 k, tab delim. data, head stat_, value

This are the statistics from masking the fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk_MBY_tms_MBY_lcm with
the mask linkers_2 to generate the masked fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk_MBY_tms_MBY_lcm_MBY_ln2.
MASKED_CHARS  = number of characters masked with the application of this mask.
Masked_Seqs   = number of sequences masked with the application of this mask.
Masking_Segs  = number of segments used in the application of the mask


seq MBY pdb MBY tmb MBY sig MBY lcv MBY lnk MBY tms MBY lcm STAT 1 k, tab delim. data, head stat_, value

This are the statistics from masking the fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk_MBY_tms with
the mask lcm_segs to generate the masked fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk_MBY_tms_MBY_lcm.
MASKED_CHARS  = number of characters masked with the application of this mask.
Masked_Seqs   = number of sequences masked with the application of this mask.
Masking_Segs  = number of segments used in the application of the mask


seq MBY pdb MBY tmb MBY sig MBY lcv MBY lnk MBY tms STAT 1 k, tab delim. data, head stat_, value

This are the statistics from masking the fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk with
the mask tm_segs to generate the masked fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk_MBY_tms.
MASKED_CHARS  = number of characters masked with the application of this mask.
Masked_Seqs   = number of sequences masked with the application of this mask.
Masking_Segs  = number of segments used in the application of the mask


seq MBY pdb MBY tmb MBY sig MBY lcv MBY lnk STAT 1 k, tab delim. data, head stat_, value

This are the statistics from masking the fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv with
the mask linkers_1 to generate the masked fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk.
MASKED_CHARS  = number of characters masked with the application of this mask.
Masked_Seqs   = number of sequences masked with the application of this mask.
Masking_Segs  = number of segments used in the application of the mask


seq MBY pdb MBY tmb MBY sig MBY lcv STAT 1 k, tab delim. data, head stat_, value

This are the statistics from masking the fasta file seq_MBY_pdb_MBY_tmb_MBY_sig with
the mask lcv_segs to generate the masked fasta file seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv.
MASKED_CHARS  = number of characters masked with the application of this mask.
Masked_Seqs   = number of sequences masked with the application of this mask.
Masking_Segs  = number of segments used in the application of the mask


seq MBY pdb MBY tmb MBY sig STAT 1 k, tab delim. data, head stat_, value

This are the statistics from masking the fasta file seq_MBY_pdb_MBY_tmb with
the mask signal_segs to generate the masked fasta file seq_MBY_pdb_MBY_tmb_MBY_sig.
MASKED_CHARS  = number of characters masked with the application of this mask.
Masked_Seqs   = number of sequences masked with the application of this mask.
Masking_Segs  = number of segments used in the application of the mask


seq MBY pdb MBY tmb STAT 1 k, tab delim. data, head stat_, value

This are the statistics from masking the fasta file seq_MBY_pdb with
the mask tm_segs_best to generate the masked fasta file seq_MBY_pdb_MBY_tmb.
MASKED_CHARS  = number of characters masked with the application of this mask.
Masked_Seqs   = number of sequences masked with the application of this mask.
Masking_Segs  = number of segments used in the application of the mask


seq MBY pdb MBY tms 1 k, Bad! data, head gid_, masked_seq

This fasta file is the result of masking seq_MBY_pdb 
with the mask tm_segs


seq MBY pdb MBY tms COMP 1 k, tab delim. data, head aa_, count_n

This is the aa composition of the 
masked file from masking the fasta file seq_MBY_pdb with
the mask tm_segs to generate the masked fasta file seq_MBY_pdb_MBY_tms.


seq MBY pdb MBY tms STAT 1 k, tab delim. data, head stat_, value

This are the statistics from masking the fasta file seq_MBY_pdb with
the mask tm_segs to generate the masked fasta file seq_MBY_pdb_MBY_tms.


seq MBY pdb STAT 1 k, tab delim. data, head stat_, value

This are the statistics from masking the fasta file seq with
the mask minscop_soluble_matches to generate the masked fasta file seq_MBY_pdb.
MASKED_CHARS  = number of characters masked with the application of this mask.
Masked_Seqs   = number of sequences masked with the application of this mask.
Masking_Segs  = number of segments used in the application of the mask


seq MBY pdb fasta COMP 1 k, tab delim. data, head



seq MBY pdb fasta STAT 1 k, tab delim. data, head



seq MBY pdb psib1way COMP 1 k, tab delim. data, head



seq MBY pdb psib1way STAT 1 k, tab delim. data, head



seq MBY pdb sat9811 COMP 1 k, tab delim. data, head



seq MBY pdb sat9811 STAT 1 k, tab delim. data, head



seq MBY uc2 173 k, fasta data, head gid_, masked_seq

This fasta file is the result of masking seq 
with the mask unchar_domains_2


seq MBY uc2 COMP 1 k, tab delim. data, head aa_, count_n

This is the aa composition of the 
masked file from masking the fasta file seq with
the mask unchar_domains_2 to generate the masked fasta file seq_MBY_uc2.


seq MBY uc2 STAT 1 k, tab delim. data, head stat_, value

This are the statistics from masking the fasta file seq with
the mask unchar_domains_2 to generate the masked fasta file seq_MBY_uc2.
MASKED_CHARS  = number of characters masked with the application of this mask.
Masked_Seqs   = number of sequences masked with the application of this mask.
Masking_Segs  = number of segments used in the application of the mask


seq MBY ucd 173 k, fasta data, head gid_, masked_seq

This fasta file is the result of masking seq 
with the mask unchar_domains_1


seq MBY ucd COMP 1 k, tab delim. data, head aa_, count_n

This is the aa composition of the 
masked file from masking the fasta file seq with
the mask unchar_domains_1 to generate the masked fasta file seq_MBY_ucd.


seq MBY ucd STAT 1 k, tab delim. data, head stat_, value

This are the statistics from masking the fasta file seq with
the mask unchar_domains_1 to generate the masked fasta file seq_MBY_ucd.
MASKED_CHARS  = number of characters masked with the application of this mask.
Masked_Seqs   = number of sequences masked with the application of this mask.
Masking_Segs  = number of segments used in the application of the mask


seq lengths 5 k, tab delim. data, head gid_, length_n

Length of each sequence in genome.


tigr annote 9812 9710 diffsonly 5 k, tab delim. data, head



tigr annote 9812 9710 merge 51 k, tab delim. data, head



tigr seq dec98 204 k, fasta data, head



tigr seq dec98 lengths 5 k, tab delim. data, head gid_, length_n

Length of each sequence in genome.
  my txdb=HASH(0x80c0704) = txdb->new (name=>tigr_seq_dec98,io=>INPUT_SLURP_FASTA,ext=>fa);


tm scores 14 k, tab delim. data, head id_, sumscr, sig, minhall, ntmproc, totaa, avg_en, minhseg

  This table contains scores determining whether to what
  degree the sequences is an integral membrane protein.   
   sig = does it have a signal sequence?
   sumscor = overall evaluation score (see below)
   minhall = min hydrophobicity value for 20 res. window moved over whole prot.
   totaa  = total number of aa under -1 threshold
   ntmproc = tot num of TM helices after processing
   avg_en = average hydrophobicity of all the TM segments (per residue)
   minhseg = min hydrophobicity of a TM segments (per residue)
   #
   # These parameters were refined on MG
   # see genomes/mg-analyze-maxh-981127.xls
   # 
   my  = (minhall<-2 ? 4 :
		 (tot_aa > 50 ? 3 : 
		  ( minhall <-1.75 ? 2 :
		    ( tot_aa > 20 ? 1 : 0))));


tm segs best 10 k, tab delim. data, head id_, start_I, stop_n, sumscor, energy_f

This is the segments from TABLE tm_segs 
that have a sumscor = 4.


tm segs filtered 5 k, tab delim. data, head id_, start_I, stop_n, energy_f

Transmembrane segment definitions after removing pdb matches and (most
importantly) low-complexity regions. The tm_segs table is just
the raw data.
This is based on looking at the masked the file seq_MBY_pdb_MBY_lcl_MBY_tms_MBY_lnk for the TM
segments (annotated with a 3).


unchar domains 1 5 k, tab delim. data, head id_, start_I, stop_n

Linker regions between two other defined segments, 
which are greater in length than 50 
That is, these are uncharacterized protein domains. 
This is done at phase _1.
generate_linkers() running on MG/seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv with tag _1 


unchar domains 2 5 k, tab delim. data, head id_, start_I, stop_n

Linker regions between two other defined segments, 
which are greater in length than 50 
That is, these are uncharacterized protein domains. 
This is done at phase _2.
generate_linkers() running on MG/seq_MBY_pdb_MBY_tmb_MBY_sig_MBY_lcv_MBY_lnk_MBY_tms_MBY_lcm with tag _2 


unchar domains 2 annotated 45 k, tab delim. data, head gid_, start_, stop, score, status, amtanot, len9812, len9710, lendif, homolog, tigr9812_annotation

This table is derived from the UCDs in unchar_domains_2
This table merges the TIGR annotation with the UCD regions. 
gid_	=	TIGR Genome Identifier
status	=	0 for same in both ORF files, 1 or 2 for dotted in 9812 file and missing in 9710 file, -1 for missing in 9812 file but in 9710 file
amtanot	=	Level of annotation (0 for hypothetical protein, 1 for putative, and 2 if there seems to be clear assignment)
len9812	=	Length of ORF in 9812 file ( '_' if missing)
len9710	=	Length of ORF in 9710 file ( '_' if missing)
lendif	=	Absolute difference in lengths (9999 if an ORF is not present) 
homolog	=	homolog in 9812 ORF file annotations (MP = M. pneumoniae, MG = M. genitalium, EC = E. coli, &c)
tigr9812_annotation	=	Anotation from 9812 ORF file less homologs
score looks this: A-BB-CC
A  = X if bad, F if functionally annotated, U if hypothetical or putative annotation
BB = FL if uncharacterized region spans the whole ORF
CC = MG if uncharacterized region has a paralog in MG
A sample score is:
U-FL-== (completely uncharacterized full length UCD without paralogs in MG)


unchar domains 2 have func 5 k, tab delim. data, head gid_, start_, stop, score

This table is the part of unchar_domains_2_annotated that corresponds
to uncharacterized regions that have a well-characterized function. 
gid_	=	TIGR Genome Identifier
score looks this: A-BB-CC
A  = X if bad, F if functionally annotated, U if hypothetical or putative annotation
BB = FL if uncharacterized region spans the whole ORF
CC = MG if uncharacterized region has a paralog in MG
A sample score is:
U-FL-== (completely uncharacterized full length UCD without paralogs in MG)


unchar domains 2 no func 4 k, tab delim. data, head gid_, start_, stop, score

This table is the part of unchar_domains_2_annotated that corresponds
to uncharacterized regions that DO NOT have a well-characterized function. 
gid_	=	TIGR Genome Identifier
score looks this: A-BB-CC
A  = X if bad, F if functionally annotated, U if hypothetical or putative annotation
BB = FL if uncharacterized region spans the whole ORF
CC = MG if uncharacterized region has a paralog in MG
A sample score is:
U-FL-== (completely uncharacterized full length UCD without paralogs in MG)


when all struc report 9 k, tab delim. data, head year, stat, ucd-MG, pdb-MG, pdb_MBY_tmb-MG, pdb_MBY_tmb_MBY_sig-MG, pdb_MBY_tmb_MBY_sig_MBY_lcl-MG, pdb_MBY_tmb_MBY_sig_MBY_lcl_MBY_tms-MG, pdb_MBY_tmb_MBY_sig_MBY_lcl_MBY_tms_MBY_lnk-MG, value

Report on what fraction of the genome remains
uncharacterized structurally. Based 
on the following genomes
MG
and the following selections
ucd pdb pdb_MBY_tmb pdb_MBY_tmb_MBY_sig pdb_MBY_tmb_MBY_sig_MBY_lcl pdb_MBY_tmb_MBY_sig_MBY_lcl_MBY_tms pdb_MBY_tmb_MBY_sig_MBY_lcl_MBY_tms_MBY_lnk
(Modified on 981128 to only accomodate MG.)


[census home]