The chemical information ontology (CHEMINF) is implemented with the Web Ontology Language (OWL2) 
. Classes in the ontology have identifiers of the form http://semanticscience.org/resource/CHEMINF_XXXXXX
, and include labels (rdfs
label) and definitions (dc
description). The ontology is versioned using owl
VersionInfo. All CHEMINF resources are Linked Data nodes, and their URIs are dereferencable.
The expressivity of the ontology is
), thus contains atomic concepts and roles, transitive roles, conjunction, disjunction, existential and value restriction, role hierarchies, inverse roles, number restrictions and datatypes 
. CHEMINF extends the Ontology for Biomedical Investigations (OBI) 
, the Information Artifact Ontology (IAO) 
, the Relationship Ontology (RO) 
and the Basic Formal Ontology (BFO).
The ChEMINF ontology includes entities such as:
- Chemical graphs, and various formats for encoding them.
- Chemical descriptors, with definitions and axioms describing what they are specifically about.
- Specifications for certain descriptors.
- Algorithms and their software implementations and axioms describing their inputs and outputs.
- Chemical data representation formalisms and formats.
Additionally, we have identified a hierarchy of chemical qualities, which are needed to specify exactly which quality a chemical descriptor is describing. However, in keeping with the OBO Foundry's principle of orthogonality of ontology application domains, we have submitted these chemical quality terms to the Phenotype Quality ontology (PATO) 
We explicitly exclude from the scope of CHEMINF:
- Actual chemical entities, parts, ions, groups, etc which are included in the ChEBI ontology .
- Any aspects of protein or nucleotide sequence information which are included in the Sequence Ontology .
- We include named algorithms, but do not give the algorithmic steps. The relevant paper describing the algorithm is linked to from the definition where possible.
- Similarly, for format specifications (such as Chemical Markup Language (CML) ), we provide a citation rather than reproducing the detail of the specification.
The ontology is licensed as Creative Commons Share-Alike By Attribution and is freely available from the Google Code project site http://semanticchemistry.googlecode.com
. To preserve modularity and ease of maintenance, the ontology consists of multiple files, with one such separate file, for example, providing mappings to the Blue Obelisk Descriptor Ontology. These separate files are referenced from the primary ontology file cheminf.owl using the OWL import mechanism.
Ontology content and organisation
provides a schematic overview of the content of the CHEMINF ontology. The basic content of the domain terminology can be divided into named descriptors, named algorithms which calculate descriptors, and software libraries which contain software modules that implement algorithms.
An overview of the content of the CHEMINF ontology.
Not illustrated in this diagram are the processual executions of the software implementations, as these fall within the hierarchy of processes rather than information entities. The link between processes which are software executions, and the software implementation that is executed, is that the process has the software as agent.
The key entities in our ontology, situated beneath their appropriate superclasses in the referenced ontologies, and including their number of subclasses, are given in . The most well-developed branch of the ontology is the chemical descriptor branch, as we have already included 180 different descriptors in the ontology, including a broad range from simple descriptors such as atom count
to more complex descriptors such as topological polar surface area
. Descriptors give information about qualities of chemical entities, and to formalise this association, we have added 50 chemical entity qualities to the quality branch of the ontology, including polarizability
and relative permittivity
. Format specifications, such as MDL molfile 
and the Chemical Markup Language CML 
, are the next largest branch of the ontology, and finally the algorithm and software implementation branches of the ontology are not yet well developed, although there is an ongoing effort to include all algorithms and implementation details for the Chemistry Development Kit (CDK) 
in the ontology, and this will be forthcoming in a future release.
Key entities in CHEMINF ontology and their immediate superclasses.
The key relations in our ontology are:
- An information content entity is about some entity; an entity is described by some information content entity. (We note that the inverse of the is about relation is considered problematic from an ontological perspective since information is not a property of the thing it is information about. However, we introduce this inverse relation is described by here as a convenient shorthand for referring back from an entity to information, where the aboutness is already captured in the ontology.) The IAO is about relationship is further specialized into different subrelations, of which one example is the is quality measurement of relationship, which relates a measured datum to the particular quality that it is a measurement of. While this relation is close to what we need in order to relate chemical information entities to properties of the chemicals that they are about, we allow chemical information that is both measured and calculated, and we therefore introduce a distinct relation is descriptor of.
- A chemical descriptor is descriptor of some specifically dependent continuant (quality or other property); a specifically dependent continuant has descriptor some chemical descriptor.
- An information content entity conforms to some directive information entity (i.e. specification); the directive content entity specifies an information entity.
- An entity has attribute some data item; the data item is attribute of some entity.
Other relations which we make use of in the ontology including the bearer of relation which links independent entities to the dependent entities (such as qualities) which inhere in them, and the has part and part of mereological relations, which are inherited from the Relation Ontology, and the has value data relation which links a data item to its value.
We now discuss the ontology model in more detail for the specific topic areas of format specifications, chemical descriptors, and algorithms and implementations.
In cheminformatics, many information objects are created in order to standardise or specify formats for data exchange or the operational requirements of a particular procedure. These information objects have a kind of normative content, creating – in their information content – a requirement on the information objects that conform to them. We model this type of information object as directive information entity.
Definition 3 A directive information entity is an information content entity that explicitly states essential attributes/requirements for a product or procedure, and may also be used to determine that the product/procedure meets its requirements/attributes.
One special kind of directive information entity is that which specifies the format for the encoding of information such that it can be encoded and decoded in a standard way. This is a data format specification.
Definition 4 A data format specification provide directives regarding the syntax of information such that it can be encoded and decoded in a standard fashion.
Some examples of data format specifications are the MOL and SD file format specifications commonly used for chemical graph storage and exchange 
, the Simplified Molecular Input Line Entry Specification (SMILES) format specification 
, and the basic data format specifications such as integer or numeric which are associated with the input parameters of algorithms, as illustrated in .
Textual and numeric data format specifications.
These data formats are then used in the definition of different types of chemical descriptors.
The most general type of chemical information entity is that which captures some sort of data about some chemical entity. We use the term chemical descriptor.
Definition 5 A chemical descriptor is a data item (a quantity or value) whose syntax and semantics conforms to some data format specification and provides information about chemical entities including, but not limited to reactions, substances, molecular entities, and their parts (rings, atoms, bonds, etc).
Note that the term ‘descriptor’ has a narrower meaning in some cheminformatics communities, i.e. restricted in use to only those descriptors which have numeric values and which can be used in quantitative structure-activity-relationship models. For these types of descriptor, we propose the subtypes ‘numerical chemical descriptor’ which is defined in terms of the data type of the descriptor, and ‘QSAR chemical descriptor’ which is described in terms of the applicable usage of the descriptor.
Chemical descriptors may enumerate material or processual parts, quantify qualities or realizables including dispositional probabilities. For example, a SMILES descriptor, which conforms to the SMILES specification for unambiguously describing molecular structure using short ASCII strings, can be created for aspirin (acetylsalicylic acid, CHEBI:15365) with value CC(
The following example shows, in Manchester OWL syntax 
, some descriptors (SMILES, InChI and InChIKey) associated with aspirin (acetylsalicylic acid) using CHEMINF:
Class: ‘acetylsalicylic acid’
Individual: ‘acetylsalicylic acid InChI’
Individual: ‘acetylsalicylic acid InChIKey’
‘has value’ “InChIKey
Individual: ‘acetylsalicylic acid SMILES’
‘has value’ “CC(
Descriptors for chemical entities often describe aspects of the structure of chemical entities. Structural descriptors have the additional property that, while remaining within the rules of the structural representation formalism, cannot change value without representing a different entity. To put this differently: chemists cannot have a meeting and decide to give a different structural descriptor to a particular chemical entity, as they can for a name. The structural descriptor is constrained by the format specification and the structure being described. Chemists could, of course, decide to change the format specification, and many new structural descriptors are born through the invention and specification of new formats. Note that in many cases, the structure of a chemical entity may not be known at the time that the chemical is named. In other cases, a structure is presented but is later found to be incorrect, and needs to be revised throughout public databases. Chemical entities are therefore not identical with their structural representations (such as chemical graphs). Indeed, structural representations give a static view of the nature of chemical structures, which is an approximation to the actual dynamic reality.
Our model allows the explicit linking not only of a descriptor to the kind of entity it is about (such as a molecule), but also to the particular property of that entity that the descriptor is representing. For example, a charge descriptor is descriptor of the electrical charge quality of a molecule. In this way, descriptors can be grouped together based on the nature of the properties that they describe. However, there are some descriptors for which the exact molecular property that the descriptor is describing is unclear; in these cases we remain agnostic and make no assertion above the claim that the descriptor is about the molecule, with the possibility to pick out those specific attributes which formed the input to the descriptor calculation.
shows an illustration of the CHEMINF ontology model for chemical descriptors. Chemical descriptors are data items which are about chemical entities. They conform to a chemical data format specification, and they are descriptors of a property (quality or realizable) which inheres in a chemical entity.
Chemical descriptors can be obtained from physical experiments, in which something is quantitatively measured. We say that an experiment involves some chemical substance as input and produces some chemical data as output. For instance, the structure of a chemical substance can be investigated using nuclear magnetic resonance (NMR); this requires as input some chemical substance in buffered solvent within some concentration range and produces as output resonance frequencies.
On the other hand, descriptor values can be generated in silico from the analysis of computational representations of chemical entities by software applications.
Algorithms and software implementations
When using software to predict chemical attributes, software applications consume some kind of data and produce some kind of data. Software, modules and methods are expressed as source code using programming languages that are subsequently compiled into a machine interpretable format. These software methods are specified by one or more algorithms, or sequences of steps. Like format specifications, algorithms are directive information entities.
Definition 6 An algorithm is a directive information entity that consist of a finite sequence of instructions to accomplish a task, which may be expressed in pseudocode, textual description, or a process flow diagram.
Chemical descriptors are distinguished from the algorithms which generate them, although in many cases they share a common name, since algorithms specify procedural information, while descriptors are declarative information. In some cases, the same descriptor can be calculated by several different algorithms.
Named algorithms may have different versions. For example, the Kabsch algorithm for calculating the optimal rotation matrix for alignment of two chemical structures was first presented in 
, and a later correction was presented in 
. In this case it can be said that there are two versions of the Kabsch algorithm, and it is useful to distinguish these in implementations. To model this scenario in CHEMINF, we create a superclass for the named algorithm and create subclasses for each of the versions. In cases where it is known which version is implemented in a particular library, this can be annotated to the versioned subclass, and in cases where it is not known, the annotation to the parent class can be used instead.
Algorithms are also distinguished from the software which implements the algorithms. This is because it is possible for an implementation to contain errors, or to be more or less faithful to the designed algorithm which it implements. Programming languages have different constructs and performance profiles which lead to subtle differences in different implementations of the same algorithm. For this reason, to correctly associate provenance with calculated descriptor values, we suggest at minimum the annotation of calculated values to the software implementation rather than directly to the algorithm, and preferably with detail about the fully specified process execution, as discussed below in the context of data transformation operations.
Software implementations can be stand alone single software methods, or they can be packaged into software libraries. For example, the Chemistry Development Kit (CDK) 
is a software library containing a wide collection of modules for manipulating chemical information. Software implementations are associated with a programming language, we say that the implementation has agent
the programming language.
Definition 7 A software implementation is a machine-executable set of instructions in some programming language. Software implementations generally belong to some named library, which is a collection of related software modules. Individually executable methods or components of a software implementation take input parameters, execute some operations using such input values, and produce some output parameters.
Considering the software maintenance lifecycle, most software implementations are continuously evolving. Different versions of software arise from this maintenance cycle, each being a different manifestation of the relevant source code, in that they are variants of each other.
shows the CHEMINF object model for algorithms and software implementations. Algorithms have specified output a particular chemical descriptor. A software module, which consists of one or more software methods, conforms to an algorithm. Each software method has zero or more input parameters and zero or more output data items (which may themselves become parameters as input to another software method). In addition to having data items as output, a software method may also raise software messages as output – for example, error or warning messages.
Algorithms and software implementations.
When software is actually executed within some pipeline or towards some objective, its execution is a process. The outcome of this process depends on many factors in the execution environment, including the hardware platform on which the process is executed and the operating system and other supporting libraries which are installed on that platform. For example, a data transformation operation is the execution of a software module with specific parameters as inputs.
Definition 8 A data transformation is a planned process that realizes some agent-specified objective. It requires the software which is being executed and the hardware on which it is executed as participants, and may require data items as input, and may produce data items as output.
shows the CHEMINF ontology model for data transformation operations. Since the behaviour of a data transformation operation is often dependent on the value of the input parameters, for full metadata about calculated values it is important to associate them with the fully specified process execution.
Classification of entities within the ontology
In the CHEMINF ontology, we create different axes of classification, such as the axis of classification based on the type of entity that a descriptor is about, through the use of defined classes, i.e. classes which are fully logically defined through the specification of necessary and sufficient conditions. These logical definitions allow the use of a reasoner to compute subsumption (classification) beneath differently defined parent classes. This avoids the need to maintain separate classification hierarchies by hand in order for the result to include classification along multiple possibly orthogonal axes.
Examples of classes which we have defined using necessary and sufficient conditions in this fashion are chemical substance descriptor and molecular entity descriptor, which are defined as those descriptors which are about chemical substances or molecular entities respectively. Note that a chemical substance is a bulk collection of molecular entities, such as a portion of water compared to an individual water molecule.
shows an extract from our ontology before and after the reasoner has performed a classification task, illustrating the calculated subsumption relationships. The class ‘chemical substance descriptor’ has no asserted children, but after reasoning, the children are inferred based on the information encoded for each descriptor in the ontology.
Automatic classification based on logical definitions.