While chemical databases can be queried using the InChI string and InChIKey (IK) the latter was designed for open-web searching. It is becoming increasingly effective for this since more sources enhance crawling of their websites by the Googlebot and consequent IK indexing. Searchers who use Google as an adjunct to database access may be less familiar with the advantages of using the IK as explored in this review. As an example, the IK for atorvastatin retrieves ~200 low-redundancy links from a Google search in 0.3 of a second. These include most major databases and a very low false-positive rate. Results encompass less familiar but potentially useful sources and can be extended to isomer capture by using just the skeleton layer of the IK. Google Advanced Search can be used to filter large result sets. Image searching with the IK is also effective and complementary to open-web queries. Results can be particularly useful for less-common structures as exemplified by a major metabolite of atorvastatin giving only three hits. Testing also demonstrated document-to-document and document-to-database joins via structure matching. The necessary generation of an IK from chemical names can be accomplished using open tools and resources for patents, papers, abstracts or other text sources. Active global sharing of local IK-linked information can be accomplished via surfacing in open laboratory notebooks, blogs, Twitter, figshare and other routes. While information-rich chemistry (e.g. approved drugs) can exhibit swamping and redundancy effects, the much smaller IK result sets for link-poor structures become a transformative first-pass option. The IK indexing has therefore turned Google into a de-facto open global chemical information hub by merging links to most significant sources, including over 50 million PubChem and ChemSpider records. The simplicity, specificity and speed of matching make it a useful option for biologists or others less familiar with chemical searching. However, compared to rigorously maintained major databases, users need to be circumspect about the consistency of Google results and provenance of retrieved links. In addition, community engagement may be necessary to ameliorate possible future degradation of utility.
InChI; InChIKey; Databases; Google; Chemical structures; Patents; PubChem; ChemSpider
One of the main topics in the development of quantitative structure-property relationship (QSPR) predictive models is the identification of the subset of variables that represent the structure of a molecule and which are predictors for a given property. There are several automated feature selection methods, ranging from backward, forward or stepwise procedures, to further elaborated methodologies such as evolutionary programming. The problem lies in selecting the minimum subset of descriptors that can predict a certain property with a good performance, computationally efficient and in a more robust way, since the presence of irrelevant or redundant features can cause poor generalization capacity. In this paper an alternative selection method, based on Random Forests to determine the variable importance is proposed in the context of QSPR regression problems, with an application to a manually curated dataset for predicting standard enthalpy of formation. The subsequent predictive models are trained with support vector machines introducing the variables sequentially from a ranked list based on the variable importance.
The model generalizes well even with a high dimensional dataset and in the presence of highly correlated variables. The feature selection step was shown to yield lower prediction errors with RMSE values 23% lower than without feature selection, albeit using only 6% of the total number of variables (89 from the original 1485). The proposed approach further compared favourably with other feature selection methods and dimension reduction of the feature space. The predictive model was selected using a 10-fold cross validation procedure and, after selection, it was validated with an independent set to assess its performance when applied to new data and the results were similar to the ones obtained for the training set, supporting the robustness of the proposed approach.
The proposed methodology seemingly improves the prediction performance of standard enthalpy of formation of hydrocarbons using a limited set of molecular descriptors, providing faster and more cost-effective calculation of descriptors by reducing their numbers, and providing a better understanding of the underlying relationship between the molecular structure represented by descriptors and the property of interest.
Feature selection; Variable importance; High dimensional data; Random forests; Data-mining; Property prediction; QSPR; Hybrid methodology
Solvation free energy is a fundamental thermodynamic quantity that should be determined to estimate various physicochemical properties of a molecule and the desolvation cost for its binding to macromolecular receptors. Here, we propose a new solvation free energy function through the improvement of the solvent-contact model, and test its applicability in estimating the solvation free energies of organic molecules with varying sizes and shapes. This new solvation free energy function is constructed by combining the existing solute-solvent interaction term with the self-solvation term that reflects the effects of intramolecular interactions on solvation. Four kinds of atomic parameters should be determined in this solvation model: atomic fragmental volume, maximum atomic occupancy, atomic solvation, and atomic self-solvation parameters. All of these parameters for total 37 atom types are optimized by the operation of a standard genetic algorithm in such a way to minimize the difference between the experimental solvation free energies and those calculated by the solvation free energy function for 362 organic molecules. The solvation free energies estimated from the new solvation model compare well with the experimental results with the associated squared correlation coefficients of 0.88 and 0.85 for training and test sets, respectively. The present solvation model is thus expected to be useful for estimating the solvation free energies of organic molecules.
Solvation free energy; Self-solvation; Solvent-contact model; Genetic algorithm; Atomic parameters
Since its public introduction in 2005 the IUPAC InChI chemical structure identifier standard has become the international, worldwide standard for defined chemical structures. This article will describe the extensive use and dissemination of the InChI and InChIKey structure representations by and for the world-wide chemistry community, the chemical information community, and major publishers and disseminators of chemical and related scientific offerings in manuscripts and databases.
In order to better understand the structural features of natural compounds from traditional Chinese medicines, the scaffold architectures of drug-like compounds in MACCS-II Drug Data Report (MDDR), non-drug-like compounds in Available Chemical Directory (ACD), and natural compounds in Traditional Chinese Medicine Compound Database (TCMCD) were explored and compared.
First, the different scaffolds were extracted from ACD, MDDR and TCMCD by using three scaffold representations, including Murcko frameworks, Scaffold Tree, and ring systems with different complexity and side chains. Then, by examining the accumulative frequency of the scaffolds in each dataset, we observed that the Level 1 scaffolds of the Scaffold Tree offer advantages over the other scaffold architectures to represent the scaffold diversity of the compound libraries. By comparing the similarity of the scaffold architectures presented in MDDR, ACD and TCMCD, structural overlaps were observed not only between MDDR and TCMCD but also between MDDR and ACD. Finally, Tree Maps were used to cluster the Level 1 scaffolds of the Scaffold Tree and visualize the scaffold space of the three datasets.
The analysis of the scaffold architectures of MDDR, ACD and TCMCD shows that, on average, drug-like molecules in MDDR have the highest diversity while natural compounds in TCMCD have the highest complexity. According to the Tree Maps, it can be observed that the Level 1 scaffolds present in MDDR have higher diversity than those presented in TCMCD and ACD. However, some representative scaffolds in MDDR with high frequency show structural similarities to those in TCMCD and ACD, suggesting that some scaffolds in TCMCD and ACD may be potentially drug-like fragments for fragment-based and de novo drug design.
Scaffold; Drug-likeness; Traditional Chinese medicines; Murcko frameworks; Scaffold tree; Tree maps
The Online Chemical Modeling Environment (OCHEM, http://ochem.eu) is a web-based platform that provides tools for automation of typical steps necessary to create a predictive QSAR/QSPR model. The platform consists of two major subsystems: a database of experimental measurements and a modeling framework. So far, OCHEM has been limited to the processing of individual compounds. In this work, we extended OCHEM with a new ability to store and model properties of binary non-additive mixtures. The developed system is publicly accessible, meaning that any user on the Web can store new data for binary mixtures and develop models to predict their non-additive properties.
The database already contains almost 10,000 data points for the density, bubble point, and azeotropic behavior of binary mixtures. For these data, we developed models for both qualitative (azeotrope/zeotrope) and quantitative endpoints (density and bubble points) using different learning methods and specially developed descriptors for mixtures. The prediction performance of the models was similar to or more accurate than results reported in previous studies. Thus, we have developed and made publicly available a powerful system for modeling mixtures of chemical compounds on the Web.
PubChem is a free and publicly available resource containing substance descriptions and their associated biological activity information. PubChem3D is an extension to PubChem containing computationally-derived three-dimensional (3-D) structures of small molecules. All the tools and services that are a part of PubChem3D rely upon the quality of the 3-D conformer models. Construction of the conformer models currently available in PubChem3D involves a clustering stage to sample the conformational space spanned by the molecule. While this stage allows one to downsize the conformer models to more manageable size, it may result in a loss of the ability to reproduce experimentally determined “bioactive” conformations, for example, found for PDB ligands. This study examines the extent of this accuracy loss and considers its effect on the 3-D similarity analysis of molecules.
The conformer models consisting of up to 100,000 conformers per compound were generated for 47,123 small molecules whose structures were experimentally determined, and the conformers in each conformer model were clustered to reduce the size of the conformer model to a maximum of 500 conformers per molecule. The accuracy of the conformer models before and after clustering was evaluated using five different measures: root-mean-square distance (RMSD), shape-optimized shape-Tanimoto (STST-opt) and combo-Tanimoto (ComboTST-opt), and color-optimized color-Tanimoto (CTCT-opt) and combo-Tanimoto (ComboTCT-opt). On average, the effect of clustering decreased the conformer model accuracy, increasing the conformer ensemble’s RMSD to the bioactive conformer (by 0.18 ± 0.12 Å), and decreasing the STST-opt, ComboTST-opt, CTCT-opt, and ComboTCT-opt scores (by 0.04 ± 0.03, 0.16 ± 0.09, 0.09 ± 0.05, and 0.15 ± 0.09, respectively).
This study shows the RMSD accuracy performance of the PubChem3D conformer models is operating as designed. In addition, the effect of PubChem3D sampling on 3-D similarity measures shows that there is a linear degradation of average accuracy with respect to molecular size and flexibility. Generally speaking, one can likely expect the worst-case minimum accuracy of 90% or more of the PubChem3D ensembles to be 0.75, 1.09, 0.43, and 1.13, in terms of STST-opt, ComboTST-opt, CTCT-opt, and ComboTCT-opt, respectively. This expected accuracy improves linearly as the molecule becomes smaller or less flexible.
Classification of chemical compounds into compound classes by using structure derived descriptors is a well-established method to aid the evaluation and abstraction of compound properties in chemical compound databases. MeSH and recently ChEBI are examples of chemical ontologies that provide a hierarchical classification of compounds into general compound classes of biological interest based on their structural as well as property or use features. In these ontologies, compounds have been assigned manually to their respective classes. However, with the ever increasing possibilities to extract new compounds from text documents using name-to-structure tools and considering the large number of compounds deposited in databases, automated and comprehensive chemical classification methods are needed to avoid the error prone and time consuming manual classification of compounds.
In the present work we implement principles and methods to construct a chemical ontology of classes that shall support the automated, high-quality compound classification in chemical databases or text documents. While SMARTS expressions have already been used to define chemical structure class concepts, in the present work we have extended the expressive power of such class definitions by expanding their structure-based reasoning logic. Thus, to achieve the required precision and granularity of chemical class definitions, sets of SMARTS class definitions are connected by OR and NOT logical operators. In addition, AND logic has been implemented to allow the concomitant use of flexible atom lists and stereochemistry definitions. The resulting chemical ontology is a multi-hierarchical taxonomy of concept nodes connected by directed, transitive relationships.
A proposal for a rule based definition of chemical classes has been made that allows to define chemical compound classes more precisely than before. The proposed structure-based reasoning logic allows to translate chemistry expert knowledge into a computer interpretable form, preventing erroneous compound assignments and allowing automatic compound classification. The automated assignment of compounds in databases, compound structure files or text documents to their related ontology classes is possible through the integration with a chemical structure search engine. As an application example, the annotation of chemical structure files with a prototypic ontology is demonstrated.
InChIKey is a 27-character compacted (hashed) version of InChI which is intended for Internet and database searching/indexing and is based on an SHA-256 hash of the InChI character string. The first block of InChIKey encodes molecular skeleton while the second block represents various kinds of isomerism (stereo, tautomeric, etc.). InChIKey is designed to be a nearly unique substitute for the parent InChI. However, a single InChIKey may occasionally map to two or more InChI strings (collision). The appearance of collision itself does not compromise the signature as collision-free hashing is impossible; the only viable approach is to set and keep a reasonable level of collision resistance which is sufficient for typical applications.
We tested, in computational experiments, how well the real-life InChIKey collision resistance corresponds to the theoretical estimates expected by design. For this purpose, we analyzed the statistical characteristics of InChIKey for datasets of variable size in comparison to the theoretical statistical frequencies. For the relatively short second block, an exhaustive direct testing was performed. We computed and compared to theory the numbers of collisions for the stereoisomers of Spongistatin I (using the whole set of 67,108,864 isomers and its subsets). For the longer first block, we generated, using custom-made software, InChIKeys for more than 3 × 1010 chemical structures. The statistical behavior of this block was tested by comparison of experimental and theoretical frequencies for the various four-letter sequences which may appear in the first block body.
From the results of our computational experiments we conclude that the observed characteristics of InChIKey collision resistance are in good agreement with theoretical expectations.
Although many consensus clustering methods have been successfully used for combining multiple classifiers in many areas such as machine learning, applied statistics, pattern recognition and bioinformatics, few consensus clustering methods have been applied for combining multiple clusterings of chemical structures. It is known that any individual clustering method will not always give the best results for all types of applications. So, in this paper, three voting and graph-based consensus clusterings were used for combining multiple clusterings of chemical structures to enhance the ability of separating biologically active molecules from inactive ones in each cluster.
The cumulative voting-based aggregation algorithm (CVAA), cluster-based similarity partitioning algorithm (CSPA) and hyper-graph partitioning algorithm (HGPA) were examined. The F-measure and Quality Partition Index method (QPI) were used to evaluate the clusterings and the results were compared to the Ward’s clustering method. The MDL Drug Data Report (MDDR) dataset was used for experiments and was represented by two 2D fingerprints, ALOGP and ECFP_4. The performance of voting-based consensus clustering method outperformed the Ward’s method using F-measure and QPI method for both ALOGP and ECFP_4 fingerprints, while the graph-based consensus clustering methods outperformed the Ward’s method only for ALOGP using QPI. The Jaccard and Euclidean distance measures were the methods of choice to generate the ensembles, which give the highest values for both criteria.
The results of the experiments show that consensus clustering methods can improve the effectiveness of chemical structures clusterings. The cumulative voting-based aggregation algorithm (CVAA) was the method of choice among consensus clustering methods.
Exchange of chemical structures between practicing chemists is essential to chemical communication. The International Chemical Identifier (InChI) provides a means for lossless communication of structures without resort to any proprietary software or databases nor does it require any payment or royalty fees. This perspective describes why the InChI is valuable to all chemists and how it will be an essential component of creating the chemical web.
The International Chemical Identifier (InChI) has had a dramatic impact on providing a means by which to deduplicate, validate and link together chemical compounds and related information across databases. Its influence has been especially valuable as the internet has exploded in terms of the amount of chemistry related information available online. This thematic issue aggregates a number of contributions demonstrating the value of InChI as an enabling technology in the world of cheminformatics and its continuing value for linking chemistry data.
Correctness of structures and associated metadata within public and commercial chemical databases greatly impacts drug discovery research activities such as quantitative structure–property relationships modelling and compound novelty checking. MOL files, SMILES notations, IUPAC names, and InChI strings are ubiquitous file formats and systematic identifiers for chemical structures. While interchangeable for many cheminformatics purposes there have been no studies on the inconsistency of these structure identifiers due to various approaches for data integration, including the use of different software and different rules for structure standardisation. We have investigated the consistency of systematic identifiers of small molecules within and between some of the commonly used chemical resources, with and without structure standardisation.
The consistency between systematic chemical identifiers and their corresponding MOL representation varies greatly between data sources (37.2%-98.5%). We observed the lowest overall consistency for MOL-IUPAC names. Disregarding stereochemistry increases the consistency (84.8% to 99.9%). A wide variation in consistency also exists between MOL representations of compounds linked via cross-references (25.8% to 93.7%). Removing stereochemistry improved the consistency (47.6% to 95.6%).
We have shown that considerable inconsistency exists in structural representation and systematic chemical identifiers within and between databases. This can have a great influence especially when merging data and if systematic identifiers are used as a key index for structure integration or cross-querying several databases. Regenerating systematic identifiers starting from their MOL representation and applying well-defined and documented chemistry standardisation rules to all compounds prior to creating them can dramatically increase internal consistency.
Molecular structure; Chemical databases; Systematic chemical identifiers; Quality control; InChI; SMILES; IUPAC
In this work, we analyzed and compared the distribution profiles of a wide variety of molecular properties for three compound classes: drug-like compounds in MDL Drug Data Report (MDDR), non-drug-like compounds in Available Chemical Directory (ACD), and natural compounds in Traditional Chinese Medicine Compound Database (TCMCD).
The comparison of the property distributions suggests that, when all compounds in MDDR, ACD and TCMCD with molecular weight lower than 600 were used, MDDR and ACD are substantially different while TCMCD is much more similar to MDDR than ACD. However, when the three subsets of ACD, MDDR and TCMCD with similar molecular weight distributions were examined, the distribution profiles of the representative physicochemical properties for MDDR and ACD do not differ significantly anymore, suggesting that after the dependence of molecular weight is removed drug-like and non-drug-like molecules cannot be effectively distinguished by simple property-based filters; however, the distribution profiles of several physicochemical properties for TCMCD are obviously different from those for MDDR and ACD. Then, the performance of each molecular property on predicting drug-likeness was evaluated. No single molecular property shows good performance to discriminate between drug-like and non-drug-like molecules. Compared with the other descriptors, fractional negative accessible surface area (FASA-) performs the best. Finally, a PCA-based scheme was used to visually characterize the spatial distributions of the three classes of compounds with similar molecular weight distributions.
If FASA- was used as a drug-likeness filter, more than 80% molecules in TCMCD were predicted to be drug-like. Moreover, the principal component plots show that natural compounds in TCMCD have different and even more diverse distributions than either drug-like compounds in MDDR or non-drug-like compounds in ACD.
Drug-likeness; Traditional Chinese medicines; Principal component analysis (PCA); Property distribution; Molecular properties
The early drug discovery phase in pharmaceutical research and development marks the beginning of a long, complex and costly process of bringing a new molecular entity to market. As such, it plays a critical role in helping to maintain a robust downstream clinical development pipeline. Despite its importance, however, to our knowledge there are no published in silico models to simulate the progression of discrete virtual projects through a discovery milestone system.
Multiple variables were tested and their impact on productivity metrics examined. Simulations predict that there is an optimum number of scientists for a given drug discovery portfolio, beyond which output in the form of preclinical candidates per year will remain flat. The model further predicts that the frequency of compounds to successfully pass the candidate selection milestone as a function of time will be irregular, with projects entering preclinical development in clusters marked by periods of low apparent productivity.
The model may be useful as a tool to facilitate analysis of historical growth and achievement over time, help gauge current working group progress against future performance expectations, and provide the basis for dialogue regarding working group best practices and resource deployment strategies.
Relating chemical features to bioactivities is critical in molecular design and is used extensively in the lead discovery and optimization process. A variety of techniques from statistics, data mining and machine learning have been applied to this process. In this study, we utilize a collection of methods, called associative classification mining (ACM), which are popular in the data mining community, but so far have not been applied widely in cheminformatics. More specifically, classification based on predictive association rules (CPAR), classification based on multiple association rules (CMAR) and classification based on association rules (CBA) are employed on three datasets using various descriptor sets. Experimental evaluations on anti-tuberculosis (antiTB), mutagenicity and hERG (the human Ether-a-go-go-Related Gene) blocker datasets show that these three methods are computationally scalable and appropriate for high speed mining. Additionally, they provide comparable accuracy and efficiency to the commonly used Bayesian and support vector machines (SVM) methods, and produce highly interpretable models.
Associative classification mining; Fingerprint; Pipeline Pilot; Bayesian; SVM
To improve the utility of PubChem, a public repository containing biological activities of small molecules, the PubChem3D project adds computationally-derived three-dimensional (3-D) descriptions to the small-molecule records contained in the PubChem Compound database and provides various search and analysis tools that exploit 3-D molecular similarity. Therefore, the efficient use of PubChem3D resources requires an understanding of the statistical and biological meaning of computed 3-D molecular similarity scores between molecules.
The present study investigated effects of employing multiple conformers per compound upon the 3-D similarity scores between ten thousand randomly selected biologically-tested compounds (10-K set) and between non-inactive compounds in a given biological assay (156-K set). When the “best-conformer-pair” approach, in which a 3-D similarity score between two compounds is represented by the greatest similarity score among all possible conformer pairs arising from a compound pair, was employed with ten diverse conformers per compound, the average 3-D similarity scores for the 10-K set increased by 0.11, 0.09, 0.15, 0.16, 0.07, and 0.18 for STST-opt, CTST-opt, ComboTST-opt, STCT-opt, CTCT-opt, and ComboTCT-opt, respectively, relative to the corresponding averages computed using a single conformer per compound. Interestingly, the best-conformer-pair approach also increased the average 3-D similarity scores for the non-inactive–non-inactive (NN) pairs for a given assay, by comparable amounts to those for the random compound pairs, although some assays showed a pronounced increase in the per-assay NN-pair 3-D similarity scores, compared to the average increase for the random compound pairs.
These results suggest that the use of ten diverse conformers per compound in PubChem bioassay data analysis using 3-D molecular similarity is not expected to increase the separation of non-inactive from random and inactive spaces “on average”, although some assays show a noticeable separation between the non-inactive and random spaces when multiple conformers are used for each compound. The present study is a critical next step to understand effects of conformational diversity of the molecules upon the 3-D molecular similarity and its application to biological activity data analysis in PubChem. The results of this study may be helpful to build search and analysis tools that exploit 3-D molecular similarity between compounds archived in PubChem and other molecular libraries in a more efficient way.
Ligand-based virtual screening using molecular shape is an important tool for researchers who wish to find novel chemical scaffolds in compound libraries. The Ultrafast Shape Recognition (USR) algorithm is capable of screening millions of compounds and is therefore suitable for usage in a web service. The algorithm however is agnostic of atom types and cannot discriminate compounds with similar shape but distinct pharmacophoric features. To solve this problem, an extension of USR called USRCAT, has been developed that includes pharmacophoric information whilst retaining the performance benefits of the original method.
The USRCAT extension is shown to outperform the traditional USR method in a retrospective virtual screening benchmark. Also, a relational database implementation is described that is capable of screening a million conformers in milliseconds and allows the inclusion of complex query parameters.
USRCAT provides a solution to the lack of atom type information in the USR algorithm. Researchers, particularly those with only limited resources, who wish to use ligand-based virtual screening in order to discover new hits, will benefit the most. Online chemical databases that offer a shape-based similarity method might also find advantage in using USRCAT due to its accuracy and performance. The source code is freely available and can easily be modified to fit specific needs.
Virtual screening; Ultrafast shape recognition
Assigning bond orders is a necessary and essential step for characterizing a chemical structure correctly in force field based simulations. Several methods have been developed to do this. They all have advantages but with limitations too. Here, an automatic algorithm for assigning chemical connectivity and bond order regardless of hydrogen for organic molecules is provided, and only three dimensional coordinates and element identities are needed for our algorithm. The algorithm uses hard rules, length rules and conjugation rules to fix the structures. The hard rules determine bond orders based on the basic chemical rules; the length rules determine bond order by the length between two atoms based on a set of predefined values for different bond types; the conjugation rules determine bond orders by using the length information derived from the previous rule, the bond angles and some small structural patterns. The algorithm is extensively evaluated in three datasets, and achieves good accuracy of predictions for all the datasets. Finally, the limitation and future improvement of the algorithm are discussed.
Bond type perception; Bond order; Chemical bond; Molecular modeling
HSQC spectra are routinely acquired for chemical structure analysis based on hydrogen and carbon chemical environments. Two fast HSQC peak matching algorithms have been developed; a nearest neighbour approach and a probabilistic method based on an existing discrete genetic algorithm. Both of these techniques are intended to find HSQC spectra matches that supplement information generated by established molecular fingerprint methods. Our results are compared to those calculated using a specific implementation of molecular fingerprints. The nearest neighbour and genetic algorithm-based methods ranked highly particular structures missed by molecular fingerprints. Our analysis shows that by complementing molecular fingerprint matches with our findings, a comprehensive list of matches can be identified. The refined list of compounds could be used to improve the quality of compounds used in screening libraries in the pharmaceutical industry.
HSQC; Spectral matching; Similarity; Nearest neighbours; Discrete genetic algorithm
Displaying chemical structures in LATE X documents currently requires either hand-coding of the structures using one of several LATE X packages, or the inclusion of finished graphics files produced with an external drawing program. There is currently no software tool available to render the large number of structures available in molfile or SMILES format to LATE X source code. We here present mol2chemfig, a Python program that provides this capability. Its output is written in the syntax defined by the chemfig TE X package, which allows for the flexible and concise description of chemical structures and reaction mechanisms. The program is freely available both through a web interface and for local installation on the user’s computer. The code and accompanying documentation can be found at http://chimpsky.uwaterloo.ca/mol2chemfig.
LATE X Chemfig; Molfile; SMILES; Molecular structures; Code generation
A variety of software packages are available for the combinatorial enumeration of virtual libraries for small molecules, starting from specifications of core scaffolds with attachments points and lists of R-groups as SMILES or SD files. Although SD files include atomic coordinates for core scaffolds and R-groups, it is not possible to control 2-dimensional (2D) layout of the enumerated structures generated for virtual compound libraries because different packages generate different 2D representations for the same structure. We have developed a software package called LipidMapsTools for the template-based combinatorial enumeration of virtual compound libraries for lipids. Virtual libraries are enumerated for the specified lipid abbreviations using matching lists of pre-defined templates and chain abbreviations, instead of core scaffolds and lists of R-groups provided by the user. 2D structures of the enumerated lipids are drawn in a specific and consistent fashion adhering to the framework for representing lipid structures proposed by the LIPID MAPS consortium. LipidMapsTools is lightweight, relatively fast and contains no external dependencies. It is an open source package and freely available under the terms of the modified BSD license.
There are two line notations of chemical structures that have established themselves in the field: the SMILES string and the InChI string. The InChI aims to provide a unique, or canonical, identifier for chemical structures, while SMILES strings are widely used for storage and interchange of chemical structures, but no standard exists to generate a canonical SMILES string.
I describe how to use the InChI canonicalisation to derive a canonical SMILES string in a straightforward way, either incorporating the InChI normalisations (Inchified SMILES) or not (Universal SMILES). This is the first description of a method to generate canonical SMILES that takes stereochemistry into account. When tested on the 1.1 m compounds in the ChEMBL database, and a 1 m compound subset of the PubChem Substance database, no canonicalisation failures were found with Inchified SMILES. Using Universal SMILES, 99.79% of the ChEMBL database was canonicalised successfully and 99.77% of the PubChem subset.
The InChI canonicalisation algorithm can successfully be used as the basis for a common standard for canonical SMILES. While challenges remain – such as the development of a standard aromatic model for SMILES – the ability to create the same SMILES using different toolkits will mean that for the first time it will be possible to easily compare the chemical models used by different toolkits.
Line notations; InChI; SMILES; Canonicalisation