Anthony Lewis

Bioinformatics, MB&B 452a

Professor Mark Gerstein

December 9, 1999

The Visual Display of Biological Information:

Application of Edward R. Tufte’s Principles to Biological Data

In the last decade, managing biological information has become a challenge due to the increase in the amount of data obtained. With the advent of high throughput technologies generating more and more information, design of relational databases and how to display the information queried from them becomes critical in order to access the appropriate information selectively and efficiently.

Microarray techniques and contig generation of sequences are able to discern sequences of genes, and of genomes, at rates that are orders of magnitude faster than in the past. A critical issue that arises out of this sea of information is how to access it in an efficient and logical manner. Through browser-like interfaces with relational databases, efforts are made to display information in an intuitive manner that still allow as much information as possible to be presented as possible. Some of the principles used to guide such efforts will be examined, as will a proposal for how to manage and display the large quantities of biological data.

Coping with large bodies of information is not an entirely new endeavor for those trying to display their information through browsers, either over the internet or in any application. Cgi-script and javascript allow active server pages to store the data in a database and display only that which is needed when it is queried. While this is an elegant way of managing information, it leaves little room for more intricate displays that would require more capabilities than HTML-based documents, such as generating graphics on the fly.

Edward R. Tufte has written extensively on the visual display of information. He lays down several principles that, while seemingly common sense, will prove critical in managing larger and larger quantities of data. One of his axioms is economy of ink. In making every graphical element express as much as possible, he asserts that the ink carrying data information should constitute as much of the total ink used; a corollary is that the amount of data used to frame the data should be minimized, i.e. superfluous illustrations and framing elements.[1] This is in effort to reduce distractions from the content of the data. In the same spirit, the density of the data represented per unit of area should be maximized.[2] These two principles ensure that the data represented is as rich and clear as possible.

Beyond such generalizations about the way to present the data, there are specific strategies needed to organize the information as it is displayed to the viewer. Multiple layers of information are created by multiple viewing depths and multiple viewing angles.[3] These principles can and, as will be seen, have been applied to biological data representation. Multiple depths may be implemented as seeing the chromosome map of alleles, and then zooming in to see the introns and exons, and finally the sequence, possible enriched with loci of single nucleotide polymorphisms. Multiple angles may be implemented as seeing information that relates in different dimensions: a gene may be seen in the context of the surrounding sequences and alleles, or in the context of which metabolic or signaling pathways it lies and how it is regulated, or which genes are homologous to it in different organisms.

One proposal, Zomit, is a Java™ applet that utilizes some of these principles.[4] (http://www.infobiogen.fr/services/zomit) It is a method of browsing the map of the human genome. The first view that one sees is the pairs of chromosomes in a large window on the left of the browser, and a hierarchy of location labels in the smaller window on the right. Upon clicking and dragging on a chromosome horizontally, the chromosomes enlarge, and yield labels on the specific alleles. The right window displays information relevant to where the mouse pointer is in the map on the left. Further dragging enlarges them more, and converts the right window into a more detailed map of the region where the mouse pointer is, including genetic markers.

Thus the data is seen in many perspectives. First, it is manipulable to display different depths of the information, as in what resolution one wishes to view the chromosome. Also, the data ink is maximized. The maps of the chromosomes are all that are displayed on the left. The maps themselves are laden with information. The size of genes is displayed, as well as the distance between them. Also, as one zooms in, the markers are labeled with different colors to indicate what they might be used for or what sort of marker they are. The labels for the markers are also all considered data-ink. The right window has some non-data-ink, namely the lines displaying the relationship within the hierarchy, but they are minimal and give order to the display. One might consider them non-expendable non-data-ink. The density of data across the area used may not be maximized in the first view presented, but as one zooms in, the screen is crowded with maps adjacent to each other.

One issue that was mentioned in Zomit’s technique is the use of color. Color is most effectively used to distinguish within a dimension that does not contain a very stratified range of information. That is, it is best used to distinguish between only a few potential options. In Zomit’s application, it suffices, because there are only a few colors used. However Tufte comments that color has no natural hierarchy, and so a dimension encoded in color may be difficult to understand visually without verbally coaxing oneself through it.[5] This is especially true if there are many other dimensions of the data to consider. Instead, the order inherent in shades of gray is much more easy to interpret immediately.

Zomit should be applauded for incorporating so much information in such a simple design schema. One of the pitfalls of ordering large amounts of multivariate information is that the complexity of the design required to accommodate the information can leave the viewer puzzled. Tufte again says that the “complexity of multi-functioning elements can sometimes turn data graphics into visual puzzles, crypto-graphical mysteries for the viewer to decode.”[6] A simple design is critical for immediate perception of the data and its relationships within.

Tufte’s principle developed with his work in statistics are quite germane to the task of organizing biological data. Graphics have the potential to carry much more information and interrelate it than does an equal area of text or table. With innovative programming, the basic module of an HTML based page may be succeeded by a new generation of information-rich visual interaction with databases.

[4] Stuart Pook, Guy Vaysseix,, and Emmanuel Barillot, Zomit: biological data visualization and browsing, Bioinformatics, 14 (9) p. 807-814.

[4] Stuart Pook, Guy Vaysseix^,, and Emmanuel Barillot, Zomit: biological data visualization and browsing, Bioinformatics, 14 (9) p. 807-814.