“The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. It is a collaborative effort led by the World Wide Web Consortium (W3C) with participation from a large number of researchers and industrial partners. It is based on the Resource Description Framework.” – W3C Semantic Web Activity
The Sematic Web
Overview
Chemical Semantics, Inc. is a new start-up devoted to bringing the semantic web to chemistry and biochemistry. The semantic web is referred to as Web 3.0 or alternatively the Web of Data or the Web of Meaning. It does not replace the existing World Wide Web but augments it, placing data on the web in a structured form such that the data has “meaning” and computers can understand it.
Data on the existing World Wide Web is encoded in documents containing text, table, images, etc. This data is not really recognized by computers but has to be interpreted by humans. The semantic web puts data onto the web in a form that allows computers to properly recognize it along with its meaning. Computers can then perform intelligent operations on the data, ultimately creating new data by using inference from existing data. Because computers cannot recognize the data in existing web documents, the data cannot properly be shared or even found. The future of scientific data has to involve the semantic web or equivalent technologies.
Chemistry generates enormous amounts of data. Because existing publication channels do not give credence to the data in the way they give credence to the text of a “scientific publication” much of this data is lost or discarded and not made available to other scientists. There is a trend towards journals requiring authors to submit data files along with the text of a publication. However, since there is as yet few standards and little infrastructure adopted by mainstream publishers for dealing with this data, most of it is remains in abandoned files. The appropriate answer to this dilemma is the semantic web.
In conjunction with the data, the semantic web includes a vocabulary for describing the data. This is where semantics comes to the fore. This vocabulary for a field such as Computational Chemistry is encoded in a formal language. The Web Ontology Language (OWL) is a such a language and the “vocabulary” is really an ontology describing the formal language of a specific domain such as computational chemistry. Chemical Semantics has created such an ontology which is referred to as the Gainesville Core (http://purl.org/chem/gc). This first ontology for computational chemistry will need to be modified and extended by scientists in the field as basic ontology ideas become more prevalent in chemistry.
The semantic web allows scientists to publish their data in a structured way such that it can be found and used by anyone with access to the World Wide Web. Chemical Semantics, Inc. is creating client and server software that allows scientists to automate their publishing of data into a modern graph database, Giant Global Graph (GGG), so that they and others can share their data and use it in a way that is not now possible with the existing World Wide Web (WWW). Adding semantics to scientific data and making that data available on the semantic web makes it possible to do science in a new collaborative way that will change science forever.
Publishing
To make the technology of the semantic web available to scientists, one has to “publish data” in a way that is related to the standard model of journal publishing. That is, one puts the data (not journal text) into an appropriate form, sends it to a publisher (of data not journal text), waits for its publication, and then informs colleagues that they can access the data (not journal article) in a standard fashion (more and more via the web rather than hard copy).
The appropriate form for data described above has been clearly defined by the World Wide Web Consortium (W3C) as the Resource Description Framework (RDF) standard. This standard is a Graph Database where RDF statements all take the “triple” form (subject, predicate, object). This form surpasses the normal relational database in its applicability to the web. Any and all scientific data can be put into this form and data on the semantic web is stored in servers referred to as triple stores. These triple stores may contain many billions of triples.
Chemical Semantics, Inc. operates servers that store scientific data in these triple stores. Our client software allows scientists to automate the publishing of their data to these triple stores. Initially, we are focusing on data produced by computational chemistry packages although the basic ideas apply to any chemical data including experimental data. The data produced from these computational chemistry packages depends somewhat upon the specific package but for illustrative purposes let’s assume a “CompChem” package that produces results of ab initio wave function calculations, such as Self-Consistent-Field (Hartree-Fock SCF) calculations, post Hartree-Fock correlated calculations, etc. Examples are Gamess, Gaussian, NWChem, etc.
The publishing of the results of these calculations ought not to be more difficult than having a “Publish” button in the GUI or text input of the package. For initial demonstration, we have implemented such a button in HyperChem as shown below:
The specific use of HyperChem is not relevant and any CompChem package ought to have such a button! The founders of Chemical Semantics, Inc. have a historical tie to Hypercube, Inc. and as such have used HyperChem to first illustrate the basic ideas. What this button in “CompChem” does is create an XML file structure that includes the information about the current molecular system and any current calculation results resident in the package and sends the data to the Chemical Semantics, Inc. portal where it is published given the Authors, Title, Abstract, Login information and other details that are part of the global setup prior to hitting the “Publish” button. Our current XML file structure is called CSX and is related in historical terms to the Chemical Markup Language (CML) file structure. We use CSX because CML currently does not have certain properties that we consider not only desirable but mandatory such as the ability to deal with residues as independent fundamental units of biological molecules.
Our Chemical Semantics portal accepts data publication using a REST or SOAP protocol and then publishes the data on its servers. The data is available to anyone around the world with an account at the portal. The details of the portal are described in another section below.
Searching
Our portal allows users to access data at the portal based upon various rules, search criterion, etc. Semantic web data is usually searched for using a SPARQL Protocol and RDF Query Language (SPARQL). Note the recursive acronym. A SPARQL end point is maintained at the portal which allows queries of various kinds. SPARQL has features in common with the SQL query language and is relatively easy to use but requires some experience in forming queries. A natural language front end would be desirable and many groups are involved in developing such front ends. A query, for example, could ask how many Density Functional Theory (DFT) calculations have been done on a specific molecule and with what functionals and what final total energies. As opposed to querying relational database silos, the query could potentially survey the whole world’s set of such calculations and return with a table of these.
Because the data stored by Chemical Semantics includes a relatively unlimited number of triples, searching can be very exhaustive while still being very focused. The data is defined by the ontology so that a search does not return irrelevant results. Because the data in held by a graph database where a graph node (resource) uses an arrow to point to another resource, the arrow can simply point to a resource in a second graph from the first and unlike a relational database the data from each graph can be merged trivially. This “federation” allows searches to really use a Giant Global Graph (GGG). As part of the publish activity, data can be tagged as private, protected, or public. Any search obviously returns public data. For protected data the author can pass a key to searchers to enable them to access to his/her protected data. An alternative would be to label the publication private so that only the original authors can access the data.
Data also comes with tags set by the authors that help define the search. For example an author might add a tag, “used_32bit_gpu” if he/she thought it worthwhile to distinguish calculations on a 32-bit only graphical processing unit from those on a normal cpu. Once data has been put into a semantic web form elaborate searches can be performed and the retrieved summary data could also be added as new data if so desired. Examples of SPAROL queries are shown below after the query language is described.
Fundamental Semantics Web Technologies
The semantic web uses a number of “new” technologies that are briefly described here. Most of these are new to chemists at this point and the following provides somewhat of a primer on the semantic web so that the next section describing the actual operation of publishing and querying using the facilities of Chemical Semantics, Inc. are better understood. There are many good reference books on the semantic web but they generally are written for computer scientists not chemists.
This tutorial will focus on describing computational chemistry data as applied to the semantic web.