© 1997 Oxford University Press 361-362

Footnote

Computational analysis of transcriptional regulatory elements: a field in flux

P. Bucher, J. W. Fickett and A. Hatzigeorgiou

Sequence analytic methods have played a role in the understanding of transcriptional regulation for many years (for example, an alignment of E. coli promoter regions showing conserved upstream regions was reported by Pribnow et al. in 1975). In recent years there has been a tremendous increase in experimental work aimed at understanding the fundamental biochemistry of transcription initiation as well as the mechanisms that regulate gene expression at the level of transcription. There is currently great interest in developing new computational methods as well. This interest is driven partly by the new biological understanding (and will hopefully contribute to it, by the use of quantitative models), and partly by the need to efficiently analyse newly determined genomic sequences.

Computational biologists interested in transcription have made good progress in the last few years. For example, an improved ability to describe the DNA binding specificity of proteins involved in transcription lies at the foundation of much of the mathematical modeling in the field. It has been generally recognized that consensus sequences are usually inadequate to describe DNA-binding specificity, and it is now most common to describe the binding sites of a particular protein as the set of sequences scoring above a particular threshold with a Positional Weight Matrix (PWM). Considerable theoretical work, and some experimental effort, have gone into the development of algorithms to find a PWM from known binding sites, understand to what extent the description of specificity by means of a PWM is valid, and record PWMs for particular proteins.

Other important areas of progress include computer programs for the recognition of eukaryotic promoters that for the first time have error rates low enough so that the program is of practical interest, and a number of recently developed data collections that are either more complete or more consistent than what is available in the primary sequence databases.

It may also be counted as progress that some early errors of the field have now been corrected, so that (1) it is now widely recognized that transcriptional regulation is „exceedingly complex, and that algorithms must take into account alternative pathways and the synergism of multiple transcription factors, and (2) cross-validation techniques are now commonly employed in the benchmarking of new algorithms for functional prediction.

We feel that one of the main needs in the field is simply for better communication. Experimentalists often do not take advantage of the best computational techniques; algorithm developers often base their methods on an overly simplified view of the biology; computer scientists do not use the best data collections; and mathematicians sometimes show an aversion to learning about powerful machine learning techniques. In order to promote communication and collaboration, and to assess the state of the art, we organized the first International Workshop on Computational Analysis of Eukaryotic Transcriptional Regulatory Elements, at the Deutsches Krebsforschungszentrum in Heidelberg, in January of 1996. We were very pleased to have a highly interdisciplinary meeting of about 70 people, with participation from sequence analysts, pure experimentalists, computer scientists, researchers working on the nucleosome positioning problem, microscopists using computers for image processing, structural biologists interested in the 3D structure of promoters and gene regulatory proteins, and experts from the neighbouring field of prokaryotic gene transcription. It was particularly encouraging that there were several presentations at the meeting by groups comprising both experimental and computational biologists, and that communication between the experimental and the computational side seemed to be excellent. The interdisciplinary nature of the meeting also helped to focus attention on the primary scientific goal that all participants share: to understand, by modelling and model-testing, the transcription initiation event and its use in the regulation of gene expression.

It remains a controversial issue whether function can be determined from the DNA sequence, at the level of symbol manipulation, without reference to 3D structure or other more biological representations of the data. (Most sequence analysis developers tacitly assume that such is possible, although it may be unwise to do so.) What is clear is that the work of a person in any one discipline will be much more effective if he or she is willing to understand and make use of the results of related disciplines. Experience to date suggests that in the analysis of transcriptional regulatory elements, more than in other domains of sequence analysis, a multi-disciplinary approach will be extremely important.

Several important, difficult, and yet approachable „challenges lie before us: (1) improvement of databases on gene transcription. Although important progress has been reported, it is also clear that the databases remain highly incomplete, both in literature coverage and in the kinds of information represented. In part this can be solved by funding agencies and databank teams. But in part it is due to shortcomings of the printed literature, and it will require the cooperation of the community as a whole to really improve the situation. (2) Use of the best possible computational tools by bench biologists. Currently, the best tools are often not known by, or not available to, bench biologists. An effort will be required from both sides, and perhaps from commercial software developers as well, to make progress here. (3) Integration of gene identification and promoter recognition tools. The current generation of software tools for gene identification has very rudimentary recognition of the beginning and end of genes. This was less of a problem when investigators typically sequenced the region of a single gene of interest. With large scale genomic sequencing, however, the inability of the algorithms to separate the exons of one gene from those of another has become a serious limitation. (4) Prediction of gene function via transcriptional context. To date, most attempts to determine gene function from sequence have focussed on the putative protein sequence. However the transcriptional context of a gene also provides important clues to function. Thus the ability to recognize algorithmically the set of genes transcribed in a particular tissue, or under other specific conditions, would be an important advance towards the determination of gene function. (5) Improvement of methods to describe the DNA-binding specificity of proteins. The PWM is a linear separation method. In some cases it works well, but in others it does not. It will probably be important to apply non-linear methods of separation (and perhaps develop new ones) for this problem.

Much enthusiasm for the future of this field was evinced at the workshop described above. Only a small fraction of the new results given in the 30 oral presentations and 10 posters of the meeting can be presented in this volume. We hope that this collection of nine papers will whet the appetite of many biologists, in many disciplines, to contribute to a deeper understanding of transcription and its regulation. And that a second workshop will show as much progress as the first.

Philipp Bucher

James W. Fickett

Artemis Hatzigeorgiou

A short introductory reading list

This list is by no means comprehensive, and often contains one representative paper where several others could just as well have been cited. We apologize in advance to those whose work has not been mentioned.

Early work on E. coli promoter recognition:

Harr,R., Häggström,M. and Gustaffson,P. (1983) Search algorithm for pattern match analysis of nucleic acid sequences. Nucleic Acids Res., 11, 2943-2957.

Mulligan,M.E., Hawley,D.K., Entriken,R. and McClure,W.R., (1984) Escherichia coli promoter sequences predict in vitro RNA polymerase selectivity. Nucleic Acids Res., 12, 789-800.

Pribnow,D. (1975) Nucleotide sequence of an RNA polymerase binding site at an early T7 promoter. Proc. Natl Acad. Aci. USA, 72, 784

Databases:

Ghosh,D. (1990) A relational database of transcription factors. Nucleic Acids. Res., 18, 1749-1756.

Knueppel,R., Dietze,P., Lehnberg,W., Frech,K. and Wingender,E. (1994) TRANSFAC retrieval program: A network model database of eukaryotic transcription regulating sequences and proteins. J. Comp. Biol., 1, 191-198.

Weight matrices, other sequence analysis methods:

Waterman,M.S., Arratia,R. and Galas,D.J. (1984) Pattern recognition in several sequences: consensus and alignment. Bull. Math. Biol., 46, 515-527.

Staden,R. (1984). Graphic methods to determine the function of nucleic acid sequences. Nucleic Acids Res., 12, 521-538.

Berg,O.G. and von Hippel,P.H. (1987). Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J. Mol. Biol., 193, 723-750.

Stormo,G.D. and Hartzell,G.W.III (1989) Identifying protein-binding sites from unaligned DNA fragments. Proc. Natl Acad. Sci. USA, 86, 1183-1187.

Lawrence,C.E. and Reilly,A.A. (1990) An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins, 7, 41-51.

Transcriptional specificity

Tjian,R. (1995) Molecular Machines that Control Genes. Scientific American, Feb 95, 54-61.

Fickett,J.W. (1996) Coordinate Positioning of MEF2 and Myogenin Sites. GeneCOMBIS http://www.elsevier.nl/locate/genecombis; and Gene, 172, GC19-GC32.

Promoter prediction:

Bucher,P. (1990) Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J. Mol. Biol., 212, 563-578.

Prestridge,D.S. (1995) Predicting Pol II promoter sequences using transcription factor binding sites. J. Mol. Biol., 249(5), 923-932.

Staden,R. (1988). Methods to define and locate patterns of motifs in sequences. Comput. Appl. Biosci., 4, 53-60.

Reviews:

Stormo,G.D. (1988) Computer methods for analyzing sequence recognition of nucleic acids. Annu. Rev. Biophys. Biophys. Chem., 17, 241-263.

Staden,R. (1990) Searching for patterns in protein and nucleic acid sequences. Methods Enzymol., 183, 193-211.


Top