Genome-wide expression analysis: A key for unlocking the mysteries of biological functions

 

With the advent of very efficient, high throughput sequencing technologies and the subsequent deluge of sequence data, including numerous whole genome sequences (see http://www.ncbi.nlm.nih.gov/Genomes/index.html), the inevitable question arises:  What can all this “raw” data tell us about how biological systems function?  One of the technologies currently being developed as a tool for using this data to answer this question is large-scale gene expression analysis using DNA microarrays.  With this type of analysis researchers are able to monitor the expression of an entire genome or cell type and gain information on what types of roles different genes are involved in, how they are regulated, and how expression levels differ in various cell types or states.

 

            Two main types of DNA microarrays are used in gene expression analysis, which differ in their production and application.  The first type developed by Affymetrix, Inc., uses photolithography to synthesize DNA oligonucleotides directly onto a glass slide to construct a very high-density array [1].  The key limitation to this method is the need to know the sequence of the oligos on the slide.  The other type of microarrays use ink-jet or other spotting methods to apply cDNA probes directly to the glass slide [2].  While, in contrast to the Affymetrix chips, the sequence of the sample need not be known, this method could have other limitations resulting from the amount of sample available, purity of the sample, and lower density arrays.

 

            Microarrays produced by either method can then be hybridized to mRNA, which has been fluorescently labeled.  Laser scanning microscopy is then used to detect those probes on the array that hybridized to the fluorescently labeled mRNA.  The signals can be quantified and redundant probes (different sequences in the same gene) are used to reduce signal to noise ratios [1].  One of the most common experimental platforms utilizing this technology is to compare two different samples (e.g. cells grown under two different conditions or diseased and normal cells).  RNA from the two samples is labeled with two different fluorescent labels and both are hybridized to the array.  In this way it is possible to compare the relative expression of genes in each sample by the relative abundance of hybridization as detected by the color and intensity of each spot. 

 

            By measuring transcription levels of genes in this way, under various growth conditions, different developmental stages, in different tissues, or in the presence of different drugs, it is possible to build gene expression profiles characteristic of each specific sample.  Difficulty arises, however, in extracting any kind of meaning from these profiles without some way of organizing the data into a format that allows direct comparisons.  Various methods have been developed for this type of analysis.  To begin, one can imagine the data organized into a table or matrix with rows representing genes and columns representing samples (e.g. various tissues, growth conditions, etc.) and each cell containing a number representing the expression level for that particular gene in that sample.  This matrix can then be analyzed by comparing the similarities and/or differences between the rows or columns.  For example, if two rows are similar, one might hypothesize that those genes are co-regulated and possibly functionally related.  Alternatively, columns can be compared to study differential expression of the same genes under various conditions to get an idea of their cellular roles.  Computational bioinformatics tools are used to measure the similarity (or distance) between two objects (rows or columns in the matrix) being compared.  Various methods for measuring the distance between two objects are used; two of the most common include Euclidean distance and Pearson correlation coefficient [3].  Once distances are calculated, the objects can be arranged into clusters or classifications by supervised or unsupervised methods. 

 

            The supervised approach assumes that additional information is known such as functional classes of the genes or specific states (diseased vs. normal) of the samples.  This information is then used to develop classifiers which assign predefined classes to a given expression profile.  These are then trained on a subset of samples and then tested on another subset.  After assessing the quality of prediction, they are applied to data of unknown classification.  Brown, et al. [4] have applied one type of supervised analysis, support vector machines (SVM’s) to the prediction of functional roles of uncharacterized yeast ORF’s.

 

            Unsupervised analysis attempts to cluster the objects into groups of similar properties.  This can be done using hierarchical or non-hierarchical algorithms.  Hierarchical algorithms are similar to those used to construct phylogenetic trees and come in two types: agglomerative and divisive [3].  The agglomerative method is a bottom-up approach in which the algorithm begins with a predetermined number of clusters that are successively combined until one is left.  In contrast, the divisive method starts with one cluster and successively splits it into others.  Non-hierarchical algorithms assume that the data can be divided into a specific number of well-separated clusters.  Two types of non-hierarchical analysis are k-means and self-organizing maps (SOM’s) [3].  The k-means method defines k number of points as cluster centers and then assigns each data point to a center based on minimization of the distance to the center.  Each center is then repositioned to minimize its distance to each point in that cluster.  This method is iterated until the center stops moving.  The SOM method is similar except that it has a geometric configuration and the number of nodes predefines the configuration.  Eisen, et al. [5] have developed a hierarchical clustering method that attempts to represent expression profile data in a way that makes identifying meaningful patterns intuitive and efficient.  SOM’s have also been applied to analyses of various gene expression data [6,7].

 

            Despite the fact that there are still a number of technical difficulties to be worked out in the production of microarrays, optimization of hybridization, accurate signal reading, and analysis algorithms, these types of expression analyses are already being used to unravel some of the mysteries of how biological systems function.  Primig, et al. [8] have used expression analysis data to define the core meiotic transcriptome in budding yeast.  They compared the meiotic expression patterns of two yeast strains and identified 1600 meiotically regulated genes in each, with a core set of approximately 60 % showing similar patterns in each.  Gasch, et al. [9] analyzed the expression patterns of yeast grown under a diverse range of conditions, including temperature shocks, various drugs, hypo/hyper osmotic shock, starvation, and progression into stationary phase.  They were able to identify a large set of genes that showed similar response to all conditions, as well as genes whose responses were specialized to specific conditions.  Although microarrays with the entire human genome are not yet available, similar tissue specific experiments have been done using human cDNA microarrays.  Yano, et al. [10] have done such an experiment to produce a comprehensive gene expression profile of the human renal cortex. 

 

            In addition to comparing expression data among different samples, attempts have been made to identify correlations between the expression data and other biological information such as protein localization [11,12], function, and structure [13].  The goal of these types of analyses is to be able to predict a gene’s function or a protein’s localization based on expression patterns.  Though some correlation among these types of information has been demonstrated, real predictions are difficult, owing mostly to the undefined concept of “function” (e.g. defined in terms of biochemical mechanism, cellular pathways, mutant phenotype, etc.) [14]. 

 

            The medical field has recently begun looking toward large-scale gene expression analyses as a tool for drug discovery and disease classification.  In drug discovery, microarray expression analysis allows identification of targets susceptible to various drugs as well as optimization of efficacy and reduction of toxicity [2].  It also facilitates identification of individuals most likely to respond to a particular treatment.  In disease classification, expression profiles are already being used to classify cancer and infectious diseases in order to aid in diagnosis, prognosis and therapy [2,15].  By understanding how expression profiles relate to various disease states, it is much more efficient and reliable to accurately diagnose similar diseases, have a better idea of prognosis regardless of disease progression, and to verify the efficacy of possible therapies [2].

 

            Large-scale gene expression analysis using DNA microarrays have provided one way in which genomic information can be analyzed in order to extract information about how biological systems operate.  It can provide basic information of biological processes as well as practical applications such as disease diagnosis and drug discovery.  The technology is still in its infant stages, but it has already shown that is has the potential to help unlock the many mysteries of biological function.

 

References:

 

[1] Lipshutz, R.J., Fodor, S.P.A., Gingeras, T.R., Lockhart, D.J. High density synthetic

      oligonucleotides arrays.  Nature Genetics supplement 21, 20-24 (1999).

[2] Young, R.A. Biomedical discovery with DNA arrays.  Cell 102, 9-15 (2000).

[3] Celis, J.E. et al. Gene expression profiling: monitoring transcription and translation

      products using DNA microarrays and proteomics. FEBS Letters 480, 2-16 (2000).

[4] Brown, M.P.S. et al.  Knowledge-based analysis of microarray gene expression data

      by using support vector machines.  PNAS  97, 262-267 (2000).

[5] Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D. Cluster analysis and display of

      genome-wide expression patterns.  PNAS 95, 14863-14868 (1998).

[6] Toronen, P., Kolehmainen, M., Wong, G., Castren, E. Analysis of gene expression

      data using self-organizing maps.  FEBS Letters 451, 142-146 (1999).

[7] Tamayo, P., et al.  Interpreting patterns of gene exrpression with self-organizing

      maps:  Methods and application to hematopoietic differentiation.  PNAS 96, 2907-

      2912 (1999).

[8] Primig, M., et al. The core meiotic transcriptome in budding yeast.  Nature Genetics

      26, 415-423 (2000).

[9] Gasch, A.P, et al. Genomic expression programs in the response of yeast cells to

      environmental changes.  Mol. Bio. Cell 11, 4241-42574 (2000).

[10] Yano, N., et al.  Comprehensive gene expression profile of the adult human renal

        cortex: Analysis by cDNA array hybridization.  Kidney International  57, 1452-1459

        (2000).

[11] Drawid, A.  and Gerstein, M.  A Bayesian system integrating expression data with

      sequence patterns for localizing proteins:  Comprehensive application to the yeast

      genome.  J. Mol. Bio. 301, 1059-1075 (2000).

[12]  Drawid, A., Jansen, R., Gerstein, M.  Genome-wide analysis relating expression

         level with protein subcellular localization.  TIG  16, 426-430 (2000).

[13]  Jansen, R. and Gerstein, M.  Analysis of the yeast transcriptome with structural and

        functional categories:  characterizing highly expressed proteins.  Nucleic Acids

        Research  28, 1481-1488 (2000).

[14]  Gerstein, M. and Jansen, R. The current excitement in bioinformatics-analysis of

         whole-genome expression data: how does it relate to protein structure and function? 

        Current Opinions in Structural Biology 10, 574-584 (2000).