Thaddeus Pollock 12/10/99

MB&B 452a Mark Gerstein

 

Microarray Gene Expression Data Analysis for Drug Discovery

 

A large investment into DNA sequencing, the past decade, has given rise to an explosion of large amounts of sequence data. This data typically consists of complete genome sequences for organisms, along with huge groups of expressed sequence tags (ESTs), representing individual gene mRNA products. In the next two to three years, the complete sequence of the human genome will be discovered. However, new genome analysis approaches must be undertaken to gain a working knowledge of the physiological roles of the estimated 100,000 genes, their implications in disease development, interactions with new drugs, functions in regular development, and variations within a population. One such novel approach is the study of gene expression on a genomic scale. For this efficient data mining to be carried out, specialized programs (and algorithms) must be developed and implemented. (1), (2)

Two types of information are usually added to raw sequence data: functional predictions and basic information about expression patterns. The function of a gene is usually predicted by sequence homology with other know proteins by a database search, such as FASTA or BLAST. Gene expression levels are measured by one of three high-throughput methods: enumeration of ESTs from cDNA libraries, differential display determination of gene fragments, or hybridization to oligonucleotide microarrays. DNA microarrays are particularly effective because they allow the quick measurement of gene expression levels of thousands of genes in parallel and whole genome characterization. (3) (4)

Obtaining many expression measurements is relatively easy compared to the effort necessary in verifying, standardizing, handling, and interpreting the huge amount of data that is collected. Thousands of unique spots of oligonucleotides are applied to glass microscope slides, by physical or chemical means, a flourescent labeled probe applied and washed, and hybridized probe detected by microscope or scanner. Thus the identity of each spot can be correlated with presence of a signal (i.e. degree of hybridization). Good data can only be obtained if the starting material is of high quality, positive and negative controls are used, and equipment conditions maximize signal to noise ratio and dynamic range (5)

DNA microarray technology has many features that make it particularly well suited for genome analysis by gene expression surveys. It is fairly cheap, about $60,000 for arrayer and scanner and approximately $20 per microarray copy. It can be applied to many tasks and is universal. It is fast, on the order of 150 copies of a 12,000 gene array per day, and it is fairly easy to operate. Affymetrix, a major player in oligonucleotide chip microarrays, commercially produces customized chips at a price in the thousand dollar range. (6) (7)

.

Data management is an important issue. A well-organized and functional software system must allow the quick correction of raw data to normalized expression values. Array target segmentation, background intensity extraction, target detection, target intensity extraction, ration analysis via ratio distribution function and multiple image analysis must all be carried out. Distribution parameters are typically estimated using a maximum likelihood method and ratio calibration using an iterative method, allowing the association of a p-value to each ratio measurement. (9)

Furthermore, novel and easy-to-understand representational and statistical tools must allow the researcher to view the data critically and from crucial perspectives. The GATC consortium, led by Affymetrix and Molecular Dynamics, have proposed such a tool set (theirs, in fact) to be used as an industry standard. The NIH’s NHGRI Microarray Project was also developed a software package known as Array Suite (6) (8) (9)

Accurate and organized data storage is also crucial. Typically, experiments with high density arrays create large amounts of data which has been managed in the form of a database of excel and high-resolution image files. Each array hybridization provides numerical data on expression level for every gene on the array as well as possible associated textual information giving clone names and raw data and graphical data showing emission channels. Thus, the database used must have a standardized correction and normalization feature allowing all data points to be compared easily, even from different projects. Affymetrix has developed a complete database software package know as GeneChip (LIMS, Expression Data Mining Tool, and GeneChip Suite) and has recently teamed up with Incyte Pharmaceutical’s LIFESEQ database and bioinformatics analysis system. Furthermore, government centers like the National Human Genome Research Institute are developing information systems, like ArrayDB. These new relational databases that allow extra flexibility in nature and structure of data input and are capable of linking array target sequences with retrieval systems like NCBI’s Entrez, UniGene and KEGG. (8) (10) (11) (12) (13)

Until recently, cDNA microarrays have been used mostly to examine eukaryotic cell differential gene expression. However, use of microarrays for vast amounts of data collection has shifted the bottleneck from data acquisition to data analysis. Unfortunately, since microarrays are relatively new technology, they are unavailable to many researchers, and as a result the development of computational software capable of handling vast and diverse quantities of data has only begun. Undoubtedly, the need for novel data analysis techniques, using the computational power of today’s microprocessors, will be fueled by pharmaceutical companies.

References:
(1) Regalado, A. (1999) Mining the Genome. http://www.techreview.com/articles/oct99/regalado.htm

(2) Johnston M: Gene chips: array of hope for understanding gene regulation. Curr Biol 1998 Feb 26;8(5):R171-R174

(3) Venter, J.C., M.D. Adams, G.C. Sutton, A.R. Kerlavage, H.O. Smith, M. Hunkapiller. (1998) Shotgun sequencing of the human genome. Science 5;280(5368): 1540-1542.

(4) http://www.ncbi.nlm.nih.gov/BLAST/

(5) Southern E., K.M. Shchepinov. (1999) Molecular interaction on microarrays. Nature Genetics Supplement 21: 5-9.

(6) http://www.genechip.com

(7) Lipschutz, R.J., S.P.A. Fodor, T.R. Gingeras, D.J. Lockhart. (1999) High density synthetic oligonucleotide arrays. Nature Genetics Supplement 21:20-24.

(8) Granjeaud, S., F. Bertucci, B.R. Jordan. (1999) Expression profiling: DNA arrays in many guises. BioEssays, 21:781-790.

(9) NHGRI Microarray Project, http://www.nhgri.nih.gov/

(10) http://www.genemachines.com

(11) http://www.affymetrix.com

(12) http://www.affx.com/press/incyte_jake.html

(13) Ermolaeva, O., M. Rastogi, K.D. Pruitt, G.D. Schuler, M.L. Bittner, Y. Chen, R. Simon, P. Meltzer, J.M. Trent, M.S. Boguski. (1998) Data management and analysis for gene expression arrays. Nat Genet 20(1): 19-23.