Any chemical compound naturally occurring in living organisms can be called a ‘biochemical compound’. Biochemical compounds can be classified according to their structure, physico-chemical properties or biological function. Most biologists conveniently divide all biochemical compounds into ‘biopolymers’, which consist of macromolecules, and the rest, which consist of ‘small molecules’ (see, for instance, MetaCyc Taxonomy of Compounds (1
). This dichotomy is faithfully mirrored by ‘traditional’ bioinformatics in the sense that information-rich macromolecules live in their databases such as the EMBL Nucleotide Sequence Database (2
) and UniProt (3
) independently from all other molecules, whether small or large.
Although ‘small molecules’ appear to be less complex entities than macromolecules, their naming, citation and representation in databases is not a trivial task. Most genetically encoded biomacromolecules are easily represented as one-dimensional (1D) strings, while a two-dimensional (2D) sketch remains the most adequate portrait of a ‘small molecule’. Several algorithms of linear notation have been developed, e.g. SMILES (4
). However, linear notation, as for any other structural core data, cannot be used in speech (and should not be used in free text). Good annotation practice for biological databases is to use either consistent and widely recognized terminology or unique identifiers (to look up the molecule of interest from a dedicated database) (5
It is an unfortunate fact that chemical data has for a long time been neglected by the computational biology/bioinformatics community. In publications, it is almost never featured as something worthy of attention on its own, but either in conjunction with one or another ‘omics’ or as part of a ‘data integration’ project. We consider this approach to be fundamentally flawed and that an open-access, good quality resource for chemical entities or chemical reactions has an absolute value, not just in the context of metabolic pathways or protein ligands. In order to address this issue, in 2002 a project was initiated at the European Bioinformatics Institute (EBI) to create a definitive, freely available dictionary of Chemical Entities of Biological Interest (ChEBI; pronounced /′keb1/). The primary motivation was to provide a high quality, thoroughly annotated controlled vocabulary to promote the correct and consistent use of unambiguous biochemical terminology throughout the molecular biology databases at the EBI. However, it became clear that this aim could not be achieved outside of a wider context, namely that of general chemistry and chemical nomenclature. Since its first public release (21 July 2004), ChEBI has grown to represent more than 12 000 molecular entities, groups and classes. The term ‘molecular entity’ refers to any constitutionally or isotopically distinct atom, molecule, ion, ion pair, radical, radical ion, complex, conformer, etc., identifiable as a separately distinguishable entity (6
). A group is a defined linked collection of atoms or a single atom within a molecular entity (7
). ChEBI includes classes of molecular entities (e.g. ‘alkanes’) as well as classes of groups (e.g. ‘alkyl groups’). The scope of ChEBI encompasses not only ‘biochemical compounds’ but also pharmaceuticals, agrochemicals, laboratory reagents, isotopes and subatomic particles.