A Bioinformatic analysis of upstream
regulatory sequences of glycolytic - gluconeogenesis pathway genes of yeast.
Anandasankar Ray
The recent progress in DNA sequencing has made available to researchers the
sequences of a large number of genomes. Most of the bioinformatic analysis
since then has focussed on coding sequences. An important aspect of protein
function is regulated by transcriptional control of expression of a gene, which
is mediated by transcription factors acting on the non-coding cis- DNA
sequences of a particular gene. The question I would like to address is how to
extract motifs that are shared by upstream sequences, and that may therefore be
responsible for coregulation of a group of genes, in this study - metabolic
enzymes in yeast.
Typically researchers search for known regulatory patterns (eg. The
consensus for a transcription factor) in the upstream regions of genes and look
for correlation to the expression profile. Such methods are heuristic and
limited by the knowledge of known transcription factor binding sites and
identified candidate factors that may be responsible for such control. Typical
programs used are MATINSPECTOR and PATTERN SEARCH that use databases like TRANSFAC
to match known binding sites to non-coding DNA upstream of a gene.
In recent years however, computer programs have been developed that can
identify potential regulatory sequences from within a group of co-regulated
genes. The two programs I have used for my analysis are discussed in more
detail below.
To detect unknown regulatory sites from a set of potentially co-regulated
metabolic pathway genes.
The crucial issue for success of any pattern discovery program is to select
families that are likely to be co-regulated by common sets of transcription
factors. Enzymes that belong to a same metabolic pathway can be expected to be
present at the same time in the same regions so as to carry out the functions
sequentially and hence maintain the flux of product formation. Hence their
expression may be regulated in a similar manner with their cis-regulatory
DNA having similar transcription factor binding sites. In my analysis I have
used 800 bps upstream of the genes of the Glycolysis and Gluconeogenesis
metabolic pathways of S. cerevisiae. (See Figure 1)
I have used two programs: the word detector OLIGO-ANALYSIS and the
DYAD-DETECTOR (van Helden et al., 1998; van Helden et al., 2000a; reviewed in
van Helden et al.,2000b). The OLIGO-ANALYSIS method is based on the detection
of over-represented oligonucleotides, specifically defined on the basis of
statistical significance of a site based on oligo frequency observed in all
non-coding sequences in a genome. In contrast to the heuristic methods listed
above this analysis is rigorous and exhaustive. The method has been optimized
for use with the yeast genomic sequence database. The efficacy of the program
has been demonstrated in S. cerevisiae where differential expression
data obtained from microarray analysis for different conditions of nitrogen
metabolism were used to predict potential regulatory motifs and most known
transcription factors involved with the process could be mapped onto the
differentially regulated genes (van Helden et al., 1998).
DYAD-ANALYSIS is designed to identify motifs that consist of trinucleotides,
spaced by non-conserved regions of variable width. This method is efficient for
the detection of binding sites of zinc finger proteins and other transcription
factors. This method has been shown to detect regulatory patterns in gene
clusters from DNA chip experiments for cell-cycle fluctuating genes in yeast
(van Helden, 2000a).
Both these programs are available through a web based interface and can be
used either alone or sequentially with a feature map program that automatically
generates visual representation of the positions at which patterns were found.
(http://copan.cifn.unam.mx/~jvanheld/rsa-tools/)
RESULTS
A total of 15 potential regulatory sites were found in upstream regions of
the 22 genes analyzed. Ten of these were hexanucleotide motifs detected by the
OLIGOANALYSIS program that may be recognized by different yeast transcription
factors, while five others were spaced dyads recognized by the DYAD-DETECTOR
program, that could be binding sites for zinc finger transcription factors. The
distribution of the potential regulatory elements are shown in figures 2
and figures 3.
Given below is a sumary of the results obtained. This analysis shows that
genes whose products take part in the common metabolic pathway of glycolysis
/gluconeogenesis have very similar potential regulatory regions within 800 bps
upstream of the coding regions. Such identification of potential regulatory
sites can help identify not only the cis-elements important for transcriptional
regulation of the group of genes, but also help in identifying transcription
factors that may play a role in regulation. Experimental evidence would finally
be needed to prove that some or all of these sites are indeed involved in
regulation.
TABLE 1. Frequency of occurrence of potential regulatory sites.
Gene Name |
CAGGGG |
ACGTGG |
AGGGGG |
CCCTGA |
ACAGGG |
CCCACG |
CCCCAC |
AGGAAG |
CACGTG |
GGAN{3}CAC |
CAGN{6}GGG |
GGAN{1}GCC |
PGM1 |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
PGM2 |
3 |
1 |
2 |
1 |
2 |
2 |
1 |
0 |
0 |
1 |
1 |
0 |
PGI1 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
2 |
0 |
1 |
0 |
1 |
HXK1 |
2 |
0 |
0 |
1 |
0 |
1 |
1 |
1 |
1 |
0 |
2 |
0 |
HXK2 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
2 |
0 |
0 |
GLK1 |
1 |
0 |
2 |
1 |
1 |
0 |
0 |
0 |
2 |
1 |
0 |
0 |
GAL10 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
FBP1 |
0 |
0 |
0 |
0 |
1 |
0 |
1 |
1 |
0 |
0 |
1 |
0 |
PFK1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
PFK2 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
1 |
0 |
2 |
FBA1 |
0 |
2 |
0 |
0 |
0 |
1 |
0 |
1 |
0 |
1 |
0 |
1 |
TDH1 |
1 |
2 |
2 |
1 |
0 |
1 |
1 |
2 |
1 |
0 |
1 |
0 |
TDH2 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
1 |
1 |
1 |
0 |
1 |
TDH3 |
1 |
0 |
0 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
1 |
3 |
PGK1 |
0 |
1 |
0 |
1 |
0 |
0 |
0 |
4 |
1 |
1 |
0 |
0 |
GPM1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
3 |
0 |
1 |
1 |
0 |
GPM2 |
0 |
1 |
0 |
0 |
0 |
1 |
1 |
0 |
0 |
1 |
0 |
2 |
GPM3 |
0 |
0 |
0 |
1 |
2 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
END1 |
1 |
0 |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
1 |
2 |
0 |
END2 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
1 |
0 |
1 |
CDC19 |
0 |
2 |
0 |
0 |
0 |
1 |
0 |
1 |
1 |
2 |
0 |
1 |
PYK2 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |