Yibin Xiao

MB&B 452a

 

Analysis and Applications of Expression Datasets

 

Many types of molecular biology datasets are analyzed in bioinformatics, such as raw DNA sequences, protein sequences, and macromolecular structures. Now, a rich new source of information is provided by whole-genome expression data, which gives the expression level of almost every individual gene in the context of all the genes in the genome. Expression level measurements can be made under different environmental conditions, different stages of the cell cycle, different cell types and states(healthy or diseased).

 

There are two kinds of expression experiments, monitoring the amount of mRNA or protein products by the cell. For the former, three main technologies are employed: cDNA microarrays, Affymatrix GeneChips, and SAGE. The first method measures relative mRNA levels, (yielding an 'expression ratio'), while the other two measure absolute levels. These experiments can be applied to a wide variety of organisms. At present, the largest dataset for yeast has made approximately 20 time-point measurements for 6,000 genes. For human, several projects provided data for tumor and cancer cells. Currently the technologies for measuring protein abundance are 2D gel electrophoresis followed by mass spectrometry. As gels can only resolve about 1,000 proteins, only the most abundant can be detected. Therefore expression data generated by proteomic technologies are still limited.

 

Analysis of expression data include internal clustering and cross-referencing to other external information concerning protein function, structure, localization, regulation and so on. This can be realized by either supervised learning, or relating patterns found in unsupervised learning to external data. The starting point is defining a mathematical similarity metric between each pair of expression profiles, and then constructing a distance matrix, which forms the basis for many clustering algorithms. The clustering result can be visually displayed by representing each data point in a quantitative and qualitative color scale. In unsupervised learning, the most common methods are: hierarchical clustering, k-means, and SOM(self-organizing map). The first approach groups profiles in a 'bottom-up' fashion derived from phylogenetic tree construction with the most similar ones joined first, while the other two are 'top-down' which predefine the number of clusters and need iteration until convergence. In supervised learning, a training set of observations with known classifications is required to establish appropriate model or decision boundaries, and then the classifier can be applied to a set of observations with unknown classifications. For instance, Bayesian system for localizing proteins is one kind of supervised machine learning methods.

 

Relating expression data to other external information can provide an insight into the characteristics of proteins that are expressed together under certain conditions, and also suggest some conclusions about the overall biochemistry of the cell. It can be a powerful means for functional analysis, drug development, disease diagnosis, and complex living process studying.

 

The function of a gene is usually inferred from sequence homology with other known proteins by a database search, such as FASTA or BLAST. But sequence analysis alone is insufficient to fully detect functional linkage between gene products. However, the expression experiments on yeast show that clusters of genes with similar expression profiles tend to share related functions in cellular processes. They are likely to be coregulated, and involved in a common pathway. Thus, with expression profiles assigned to functional group, function prediction becomes possible. A recent research done by Stephen Friend and colleagues is the development of a reference database of yeast gene expression profiles for 300 different mutations and chemical treatments. The cellular pathway affected by a mutation in an uncharacterized gene could be ascertained by matching the expression profile caused by the uncharacterized perturbation to profiles in the reference database. It's called the compendium approach. In many other cases, the relationship between expression profiles clusters and functional categories is not so simple. P value calculated from the distribution of correlation coefficient can be used here to evaluate the statistic significance of a given clustering of genes. Some functional groups are always highly correlated with expression profiles, while others are only correlated in certain experiments. The discrepancies also reflect the difficulty in consistently defining functions across a wide variety of organisms. By focusing on other attributes of protein other than function, such as structure, localization and regulation, sometimes a clearer relation to gene expression can be found to side-up the problem.

 

The compendium approach described above is also effective at identifying the functional consequences of chemical treatments as well as characterizing genetic perturbation. Indeed, the expression profiles obtained from drug-treated cells are similar to profiles obtained when genes encoding the drug targets are mutated. This trait can be utilized in pharmaceutical research. Chemical regents should be powerful probes to examine the functions of specific proteins and pathways in mammalian cells. Drug discovery typically begins with a biochemical pathway implicated in a pathophysiological process, which costs $50~500 million for each new drug brought to market. Microarray expression analysis is already being integrated into many steps in drug development to make the process more precise and efficient. These steps may include target identification, target validation, optimizing efficacy and reducing toxicity, and facilitating identification of clinical trial participants who best respond or adversely react to specific therapeutics.

 

Expression profiles can be considered as a molecular phenotype of the cell in a certain state, provided that the cellular transcriptional response to disruption of different steps in the same pathway is similar, and there are sufficiently unique responses to the perturbation of most cellular pathways. Using expression patterns to distinguish between cell types and disease states will improve the disease diagnosis. And the information obtained through classifying the diseases should contain valuable clues to the mechanisms and inspire new tactics to combat these diseases. Current cancer classification techniques rely on highly subjective judgements of tumor histology by pathologists. The ability to distinguish morphologically similar human cancers by the differences between their expression profiles is important because accurate identification of tumor types will facilitate matching the appropriate therapies, thereby maximizing therapeutic efficacy and minimizing toxicity. Infectious diseases are another major challenge to human health. The human population is continuously threatened with newly emerging and increasingly drug-resistant pathogens, many of which are difficult to culture and thus identify. Expression profiles of human macrophages infected by various viral and bacterial pathogens suggest that it's feasible to obtain signatures for specific pathogens by capturing their effects on host cell gene expression.

 

Expression data will be helpful to decipher the rules governing complex living processes and classify those processes, since it provides the basis for understanding how genes work together to guide the functions of cells and organisms. Hartwell, Hopfield, et al have argued for recognition of functional modules as fundamental elements of biological organization and regulation as most biological functions arise from interactions among many components. Expression data analysis can help to detect functional linkages between gene products, and thereby trace out networks of interacting proteins. Among the complex systems in biology are regulatory networks that govern gene expression. Just as knowledge of biochemical pathways has been fundamental to basic biology and to drug discovery, information of the regulatory networks will facilitate modeling of biological processes and efforts to develop therapies for disorders and diseases. Biologists and computer scientists are now exploring how to use genome-wide expression profiles to deduce models for the regulatory networks.

 

Analysis and interpretation of expression data have brought profound impact on the research in biology, pharmacology, and medicine. However, major conceptual and technical challenges lie ahead in this rapidly evolving arena. The experimental standards, expression database management systems, and computational tools that facilitate comparisons of large datasets are still under development.

 

 

 

 

References :

1. Gerstein M & Jansen R (2000), The current excitement in bioinformatics, analysis of whole-genome expression data: How does it relate to protein structure and function? Current Opinion in Structural Biology 2000, 10:574–584.

2. Lipschutz, R.J., S.P.A. Fodor, T.R. Gingeras, D.J. Lockhart. (1999), High density synthetic oligonucleotide arrays. Nature Genetics 21:20-24.

3. Richard A Yang (2000), Biomedical Discovery with DNA arrays. Cell 102:9-15.

4. Eisen M.B, Spellman P.T, Brown P.O, & Botstein D (1998), Cluster analysis and display of genome-wide expression patterns, Proc. Natl. Acad. Sci. USA 95: 14863-8

5. Hughes T.R, Marton M.J, Jones A.R, Roberts C.J, et al (2000), Functional discovery via a compendium of expression profiles. Cell 102:109-126

6. Jansen R, Gerstein M (2000), Analysis of the yeast transcriptome with structural and functional categories. Nucleic Acids Res. 28: 1481-1488

7. Gerstein M(2000). Integrative database analysis in structural genomics, Nature Structural Biology 7: 960-963.

8. Debouck C, Metcalf B (2000), The impact of genomics on drug discovery. Annu. Rev. Pharmacol. Toxicol. 40:193-207

9. Alon U, Barkai N, Notterman D.A, Gish K, et al (1999), Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA 96: 6745-6750

10. Golub T.R, Slonim D.K, Tamayo P, Huard C, et al (1999), Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286: 531-537

11. Hartwell L.H, Hopfield J.J, Leibler S, Murray A.W (1999), From molecular to modular cell biology. Nature 402: C47-C52

12. Eisenberg D, Marcotte E.M, Xenarios I, Yeates T.I (2000), Protein function in the post-genomic era. Nature 405: 823-826