(a) Schematic showing the derivation of the YG data set and its breakdown into subsets. The steps in the derivation of YG are summarized in the Methods section. The size of YG is indicated for the last two steps in this procedure. The name YG1-x indicates YG after x steps. The final YG data set comprises 2,168 sequences. The subsets YGM, YGR, YGE and Y(GE)P that are mentioned in the text are indicated as a Venn diagram.
(b) An example of a paralog family with associated pseudogenes. The positions of genes for the paralog family whose representative is the sequence C02F4.2, are indicated by grey ovals (totalling 40). The pseudogenes are marked with black ovals (totalling 4). A pseudogene fragment (YC02F4.2) from chromosome II is shown along with an example of a gene from this paralog family W09C3.6 (which is for a serine/threonine protein phosphatase PP1) with the homologous segment underlined. The pseudogene is interrupted by a frameshift relative to this gene (marked by a # symbol). The corresponding sequence in the gene paralog is boxed in black. This corresponds to one exon of the gene paralog. The stop codon of the gene is marked by an asterisk (*).
The estimated chromosomal distribution of pseudogenes. Each panel depicts the distribution of genes (left hand side) and pseudogenes (right hand side) for the chromosomes I, II, III, IV, V, X. The EST-matched subsets for each chromosome are binned as a dark grey bar with the remainder of the genes pseudogenes as a light grey bar. The bin size is 250,000 bases. The axis for number of pseudogenes is scaled by two (X2) relative to the same axis for genes.
Figure 3: Disablements, length and composition for YG.
(b) Length distribution of pseudogene matches. The distribution of pseudogene match lengths (in nucleotides) is shown as an intermittent line, and of lengths for worm gene exons by a continuous line. The lengths of the Sanger center annotated genes are not included as these are more carefully parsed predictions arising from a gene prediction algorithm. Each point n denotes the count of exons or matches for an interval from n to 50-n. Every fourth point is indicated on the x-axis.
(c)Composition for YG. The amino-acid composition of the Wormpep18 database is compared to the implied amino-acid composition of random non-repetitive genomic sequence and the YG population. The percentage composition for each of the twenty amino acids is graphed in decreasing order of the implied amino acid composition in the pseudogene set. In the bottom part of the figure, the YG difference for each amino acid composition is indicated by a bar. This is defined as (|w-p| + |p-r|) / p, where w is the amino-acid composition value for the Wormpep18 proteins, r is the implied composition for random genomic sequence and p is the implied pseudogene composition. The asterisk (*) in this graph represents the termination codons. The number of codons for each amino-acid type is written below the one-letter code for the residue.
Plot of the number of genes in a paralog family (Gfamily) versus the number of pseudogenes in a paralog family (YGfamily). The families from the GE set are marked as grey filled points, with the remainder as unfilled points. The lines indicate the overall ratio of the number of genes to the number of pseudogenes for the whole genome and for the GE subset. Families with large numbers of genes and/or pseudogenes are labelled with the name of their family representative.
The folds and pseudofolds in the worm genome. The SCOP domain matches (part (a) of the figure) are extrapolated onto Wormpep18 from assignments made previously on Wormpep17 proteins . Pseudofold assignments (part (b)) are taken from the closest matching gene paralog for each pseudogene. The columns are as follows: Rank for folds or pseudofolds (with total numbers in brackets); corresponding rank for pseudofolds or folds; a fold cartoon; the representative domain, the SCOP 1.39 domain number and a brief description of the fold. The fold cartoons are coloured in a sliding gradient from blue for the N-terminus to red for the C-terminus.