CSX – Chemical Sematics Markup

Overview

Chemical Semantics, Inc. uses an Extensible Markup Language (XML) format file to capture information about computational chemistry calculations. Specifically, a Chemical Semantics XML (CSX) file is used to transfer structured data and metadata about calculations to our web portal where it is converted to the Resource Description Framework (RDF) format appropriate to the semantic web.

While CSX has been developed to allow publication pf computational chemistry calculations to the semantic web, it is a useful standard in and of itself because it organizes all the important information about a calculation in a format that is readable by both humans and computers. We suggest that CSX could become a standard output format for the computational chemistry community and invite interested readers to contribute to its development.

This section of our web site thus describes the current CSX standard for describing data from computational chemistry calculations (and possibly related experimental data). The standard specifically includes information describing a publication, i.e. title, author, etc. because our portal essentially accepts “data publications” and places them onto the semantic web. CSX is still currently under development and so please be aware that the description below is dated November 2014. A CSX file has a version number and the current version described here is Version 1.0.

CSX Components

A CSX file is shown below:

csx_overview

NameSpaces

The CSX file above includes the basic XML components including a comment that describes where the CSX file was created. The root element of the XML file is cs:chemicalSemantics. The attributes at the root include this particular version of CSX, the local file name for the CSX file, and the relevant namespaces being used. The principal namespace is cs: which signifies the XML elements defined by Chemical Semantics, Inc. Other namespaces like xsd: and xsi: are associated with the XML schema (blueprint) that describes CSX. The namespaces dc: and dcterms: are part of Dublin Core, a metadata standard that is used for some of the publication parameters of CSX like title and abstract. The namespace bse: is used to describe basis sets and is subject to modification as we collaborate with Pacific NW Laboratories on semantic definitions of standard basis sets.

CSX Sections

While other sections could be added in future versions of CSX, the current standard includes three sections:

molecularPublication
molecularSystem
molecularCalculation

The Chemical Semantics portal essentially accepts “Data Publications”, i.e. data from computational chemistry computations that is being published at the portal. Thus, the first section of a CSX file describes the publication itself, including copies of the input and output files used in the calculation (if available). The second section describes the molecular system (a set of molecules) that calculations were performed on. The final section describes the calculation (or calculations) that were performed on the molecular system.

Molecular Publication

The molecularPublication section of a CSX file is shown below:

csx_publication

A publication has a title and an abstract as indicated as defined in the Dublin Core specification. We use the Dublin Core namespace, dcterms:, to describe these. The “publisher” here is empty but would contain the name of the company/institution of the lead author.

Authors

A publication can have a number of authors described by their type. The sole author here is described as cs:corresponding, i.e. the Corresponding Author who is usually the Principal Investigator (PI) or someone with authority over the publication. The attribute “type” can also have a value of cs:submitting indicating the author submitting the publication, or could be empty indicating another author or co-author. Each author has a name described by cs:creator and an organization described by cs:organization and an email address described by cs:email.

Source

The data being published should have an indication of its source, i.e. the software package that created the data such as Gamess, NWChem, PSI4, etc. The version of these software packages should also be indicated. In the example above, the data came from Release 9 of HyperChem.

For archival purposes, it is also possible to add the text that constitutes the input file for the calculation as well as the text of the output file. The publication data may have been extracted by parsing the output file or directly from the software package. In any event, the input and output file (if present) constitute additional archival data about the calculation that can be recovered later, if so desired.

Tags and Flags

Finally, the publication can indicate a set of arbitrary tags that can apply to the publication and provide an additional way to search for a set of publications. These tags are independent of searches based on SPARQL and are up to the authors to create. They may or may not be commonly used depending upon the preference of users.

A set allowed values for flags is used to characterize publications. These are:

Status:

Preliminary
Draft
Final

A publication can be considered preliminary (very raw data perhaps) or draft (not fully reviewed) or final. This status can be changed (edited) at any later time.

Visibility

Private
Protected
Public

Depending upon the desires of the authors, a publication can be set to private which means only the submitting author with a password can see the publication. Alternatively, a public publication can be seen by anyone with access to the portal. Finally, an intermediate visibility is available that is termed “protected”. A protected publication requires a key (similar to a password) that an author can pass to a collaborator so that they can see an author’s publication.

Molecular System

The molecularSystem section of a CSX file is shown below:

csx_molsystem

A molecularSystem is a collection of molecules. Each molecule is a collection of possibly residues (monomers) or perhaps just atoms. The intermediate level of “group” is also possible. That is, a molecule or residue might be considered as having groups that are made up of atoms.

Normally a molecule is just a set of atoms, unless we are describing a protein or dna-like structure where the residues would be amino acids or nucleic acids. In the above example, there is only one molecule (water) made up of three atoms.

System

The molecular system currently has three properties. The traditional charge and multiplicity of a quantum calculation are two of these. In addition, we define the system temperature, as well, for statistical mechanical calculations such as molecular dynamics (MD), Monte Carlo, etc.

Molecule

Each molecule has an id which is the name or other identifier of the molecule. A default id of “m1”, “m2”, etc. for molecule 1, molecule 2, … is suggested.

A molecule has an atomCount describing the number of atoms in the molecule and an InChI key that is meant to be a unique identifier for each molecule in the system. The InChI string and InChI key are defined by IUPAC and may or may not exist for each molecule. The value “nil” or “” indicates that no InChI key was returned from IUPAC software. Each molecule is made up of atoms.

Atom

All atoms in a molecule must have an id attribute of the form “a1”, “a2”, “a3”, etc. These identifiers are used to describe the resulting bonds.

An atom has a number of children including the elementSymbol, elementName, etc.

elementSymbol – this is just normal symbol such as C, Cl, etc.
elementName – this is the full name for the element such as Carbon, Chlorine, etc.
atomName – for macromolecules an atom may have a pertinent name such as CA, CB for the alpha, beta carbons in a chain. For normal molecules, the default names are:

H
MainGroup
Metal
Row1TM
Row2TM
Row3TM
Lanthanide
Actinide
NobleGas

atomMass – the mass in amu
formalAtomCharge – integer charge such as +1 for the N in NH4
calculatedAtomCharge – as used in molecular mechanics coulomb interactions
x/y/zCoord3D – the three Cartesian coordinates of an atom
basisSet – the basis set is a property of each atom and may be different for different atoms
coordination – describes the other atoms to which this atom is connected

Bond

Bonds in CSX are a property of atoms. The XML coordination element describes these, as children of coordination. The bondCount attribute of coordination is the number of bonds that this atom participates in. Each bond is a child of the atom’s coordination with attributes id1 and id2 that describe the two connected atoms. This means that each bond is described twice – as a grandchild of the atom with attribute id1 and as a grandchild of the atom with attribute id2. The content of the XML bond element describes the bond as single, double, triple, aromatic, or dative.

Molecular Calculation

There are different types of potential calculations that need to be placed in a CSX file although many commonalities exist. In particular scf and dft calculations have much in common. The following section of a CSX file describes a simple scf calculation.

Ab Initio SCF Calculation

csx_scf

The first XML element describes the calculation as a quantum mechanical one (as opposed to a molecular mechanics). Secondly, it uses a singleReferenceState (as opposed to a multipleReferenceState such as MCSCF). Then the calculation describes a singleDeterminant as opposed to a multipleDeterminant such as CISD). Finally, the calculation is characterized as abInitioSCF. Alternatives might be semiempiricalSCF or dft.

Attributes

The attributes of such a calculation are:

methodology – allows for deviations from the normal type of SCF calculation.
spinType – RHF, UHF, or ROHF
basisSet – as defined by the PNNL Basis Set Exchange

Child Elements of Calculation

The child elements of an SCF calculation are :

energies – core nuclear-nuclear interaction, electronic energy, total energy, etc.
properties – system properties and atom properties
waveFunction – orbital energies, orbital symmetry, coefficients, etc.

The energies are all labeled by their “type” attribute. For example, cs:totalPotential is a potential energy for nuclear motion in the Born-Oppenheimer approximation and is commonly just called the total energy in quantum calculations. The advantage of using a Uniform Resource Identifier (URI) here is that is gives uniqueness to the variable being described.

A large variety of properties (possibly expectation values) could be associated with a quantum calculation. We divide these properties into system properties like dipole moment and atom properties like mullikenCharges, A systemProperty has attributes -“name” and “unit” where name, for example, is cs:dipoleMomentX denoting the X component of the total dipole moment and unit is cs:debye.

An atomProperty also has attributes “name” and “unit” in addition to the attribute propertyCount which indicates how many atoms follow with properties. The attributes of an atom property are the moleculeId and the atomId which uniquely identify the atom (atom indices or id’s are unique only for the specific molecule that the atom is a member of). The value of the atomProperty is the content of the associated XML property element.

With quantum calculations there may be no connection table identifying a “molecule” since a molecule of the molecular system is defined by CSX as the collection of atoms forming a connected graph. For certain quantum calculations there may only be a molecularSystem with child atoms and no “molecule” or “coordination”, “bond”, etc. This is acceptable as valid CSX. It is preferred, however, to define a connection table, if possible.

waveFunction

The wave function example below shows the result for the above 3-21G calculation with its orbitals, etc. The far right side of the display is cut off and not all orbital energies, symmetries, etc. are shown.

csx_wavefunc

The wave function has attributes orbitalCount and basisCount. These are generally the same but some calculations such as those from Gamess may differ here because of the treatment of 5 spherical or 6 Cartesian d-orbitals.

The child XML elements of the waveFunction include the orbitalEnergies, cs:orbitalSymmetry, cs:orbitalOccupancy and the orbitals themselves. Each orbital is identified by its id.

Other Calculations

if the calculation is a density functional calculation, then cs: abinitioSCF is replaced by cs:dft and two new attributes appear – cs:exchangeFunctional and cs:correlationFunctional. If the calculation is MP2, then cs:abinitioSCF is replaced by cs:secondOrderMoellerPlesset and an energy with type cs:correlation is added. Other calculations such as CCD, CCSD(T), etc. may have energies and properties but not a cs:waveFunction.

If the calculation is a molecular mechanics calculation then cs:molecularMechanics replaces cs:quantumMechanics and new attributes describing the cs:forceField and cs:parameterSet appear.

Conclusion

The informal CSX file standard is a convenient way to encapsulate computational data coming from various quantum chemistry packages, for example. It will make possible publication of the data onto the semantic web as well as portability of molecular structures and results among computational and other chemists. It is a beginning and other options may become available but we believe it is a valuable contribution.

Ontologies

CSX = Common Standard of eXchange