[Go to SCIENCE Online with fewer graphics]
Sign up for content alerts via e-mail Sutter Instrument CompanyEnhanced Research Commentaries

MARK GERSTEIN || Change Password || View/Manage User Information || Subscription HELP || Sign Out (none)

PubMed Citation || Related Articles in PubMed || Download to Citation Manager

BIOINFORMATICS: How to Get Databases Talking the Same Language

Nigel Williams

CAMBRIDGE, U.K.--Biologists have been full participants in the Web mania that has swept the Internet, transforming it from a talking shop for researchers into a medium of mass communication that encompasses science, business, and leisure. Indeed, the easy access to data provided by the World Wide Web (WWW) has fed the exponential growth of biological databases housing information on everything from genome sequences to natural history collections. But this new world is increasingly balkanized. Differences in the structure of databases and in their nomenclature mean that a researcher working on a gene or protein in one species may find it exceedingly difficult to find data on the same gene in databases for other species. "There's a significant lack of interoperability. There are now hundreds of databases. There's stuff coming out of the woodwork, but we don't know what to do with it," says Graham Cameron, head of services at the European Bioinformatics Institute (EBI) near Cambridge.

Cameron and others hope to turn this fractured landscape into something more coherent. These bioinformatics experts are trying to develop common standards and names for databases and to establish links between them so that when researchers seek information from one database, they are automatically linked to other databases with additional data. "Anyone interested in the yeast gene TUP1 faces seven different names for the same gene in other species," says biochemist Amos Bairoch of the University of Geneva.

Efforts to standardize nomenclature have met with resistance from some database curators and specialists in some research fields, who hesitate to change preferred terminology. But researchers are optimistic that Web tools developed for other uses--such as the portable programming language Java--may come to the rescue by creating bridges between databases even if they differ radically in structure and nomenclature. All these efforts, however, are jeopardized by the uncertain funding for many smaller databases, says biological software designer Theresa Attwood of University College London.

The problem of nomenclature stems from the diversity of biological research communities studying gene and protein functions and the lack of common vocabulary among them. "Biochemistry is not necessarily a one-to-one mapping with genetics, but there is a clear relationship. We need to formalize a description of biological functions," says EBI researcher Chris Sander. Thus far, building links between databases has largely depended on the efforts of knowledgeable curators who check to ensure links are correct, meaningful, and up to date.

The Web, by making it easy to create links between disparate databases, has opened the way to an extension of this approach: "federations" of small databases. These avoid the need to set up common database structures by simply defining active hypertext links, via the WWW, to link relevant data between the databases. Independent database curators agree on these links. Several federations are under development. Bairoch, who developed the SWISSPROT database of protein sequences and is now part of a team building a federated two-dimensional protein electrophoresis database, SWISS-2DPAGE, says the approach makes it "more and more easy to create crosslinks."


Figure 1
Going up. Rapid growth of EMBL's public nucleotide sequence database.

EBI


But to cope with the deluge of new data and databases, such links will have to be created automatically by the database software. And the obvious first step--agreeing on nomenclature and compatible formats so that automatic links can be built up--has proved exceedingly difficult to achieve, with researchers haggling in particular over nomenclature. "Different names for the same gene are sometimes strongly held--it's really a mess. Efforts to assign a neutral name often lead to angry letters and disputes," says Bairoch. "Last June, SWISSPROT was fully integrated with the EMBL [European Molecular Biology Laboratory] nucleotide database. It has taken years to get things working," he says.

Similarly, the complete genome sequences for the bacteria Bacillus subtilis and Escherichia coli are both due to be completed this year. They have many genes in common--but under different names, creating a terminological impasse for researchers trying to find genes in common. The impasse was resolved "in a semidictatorial way," says Bairoch. "The B. subtilis names will switch to E. coli nomenclature with the old names kept as synonyms."

Despite the benefits of standardization, "the lesson has been that it can be incredibly difficult to gain consensus and to agree who has the right to change schemes. Standards are too slow and expensive," says informatics researcher Tomás Flores at the EBI. He and others are pinning their hopes on a different approach: standardizing only the messages between databases rather than their internal detail. Such an approach, referred to by programmers as "object-oriented," would involve taking chunks of data and "wrapping" them in a format that is standard across all databases--hence, the only agreement needed is on the interfaces. "The goal is for a process that's invisible. Researchers interested in a particular gene or protein would get all the information available without having to know where it was stored," says EBI's Cameron.

A group of European bioinformatics centers last year won funding from the European Union to study an interface being developed by the world's largest software consortium, the Object Management Group, which comprises most major software and information technology companies. The approach, dubbed the Common Object Request Broker Architecture (CORBA), has already been embraced by companies from aircraft manufacturers to banks. "The idea behind CORBA is that biologists will never entirely agree on common formats for data entry in databases," says Flores at the EBI, one of the centers involved. So, rather than imposing external rules, the CORBA approach tries to separate data access from data management. The project will cover several of the larger nucleotide and protein sequence databases, including EMBL and SWISSPROT, and several new and emerging ones.

The CORBA approach has also been boosted by researchers' growing interest in Java, a computer language that allows software to be sent through Internet links to carry out small applications, called "applets," remotely (see Science, 2 August 1996, p. 591). Using Java, researchers not only can access a database, but can interrogate it in an intelligent way through applets, regardless of database format. Java will use CORBA standards as a tool allowing researchers to range freely among databases. "The success of Java makes a CORBA-based approach much more likely to become mainstream," says Flores.

But funding shortages may interfere with these ambitious plans. Some funding agencies do not see some of the new specialized databases as a core, knowledge-generating activity, so grant decisions can be capricious. Even SWISSPROT, one of the oldest protein sequence databases, was endangered last year when Swiss funding agencies threatened to withdraw funding for expansion of what they saw as an increasingly international resource. "Many funding bodies will support the creation of a new database but are less willing to fund development and continuing support for them. It's crazy, and I hope the tide of things will change." says University College London's Attwood.

But in spite of the hurdles, bioinformatics researchers are mostly confident that they will be able to lower the barriers between databases. "Over 10 years, I'm optimistic because the need for bioinformatics is so clear. An increasing fraction of biology is devoted to information handling--it's intrinsically an information science," says EBI's Sander. "The needs eventually will be met."


Volume 275, Number 5298 Issue of 17 January 1997, pp. 301 - 302
©1997 by The American Association for the Advancement of Science.

PubMed Citation || Related Articles in PubMed || Download to Citation Manager

Related Items

Abstract Full Text Information Retrieval in Digital Libraries: Bringing Search to the Net
Science 17 January 1997; 275 (5298):327 (in Bioinformatics; Articles)
B. R. Schatz
Abstract Full Text Mathematical and Computational Challenges in Population Biology and Ecosystems Science
Science 17 January 1997; 275 (5298):334 (in Bioinformatics; Articles)
S. A. Levin, B. Grenfell, A. Hastings, A. S. Perelson
Abstract Full Text An Information-Intensive Approach to the Molecular Pharmacology of Cancer
Science 17 January 1997; 275 (5298):343 (in Bioinformatics; Articles)
J. N. Weinstein, T. G. Myers, P. M. O'Connor, S. H. Friend, A. J. Fornace Jr., K. W. Kohn, T. Fojo, S. E. Bates, L. V. Rubinstein, N. L. Anderson, J. K. Buolamwini, W. W. van Osdol, A. P. Monks, D. A. Scudiero, E. A. Sausville, D. W. Zaharevitz, B. Bunow, V. N. Viswanadhan, G. S. Johnson, R. E. Wittes, K. D. Paull


Copyright © 1997 by the American Association for the Advancement of Science.