The performance of the NP-likeness score depends, of course, on the choice of natural products and synthetic molecules in the training dataset. For the analysis of our engine’s performance, natural products, synthetic molecules and query compound collections were all obtained from open access databases only. Our first subset of natural products (22,876 molecules) originates from the ChEMBL database [13
], where we selected molecules extracted from the Journal of Natural Products
. The second subset of natural products (39,162 molecules) comes from the Traditional Chinese Medicine Database @ Taiwan (TCM)[14
]. Together, the natural product training set comprised 58,018 non-redundant structures. Training set of synthetic molecules comprised 113,425 clean lead-like compounds selected from the ZINC database [15
]. Small molecules from DrugBank [16
] and the Human Metabolome Database (HMDB) [17
] were treated as our test sets. Besides that, PubMed abstracts reporting isolation of new NPs were text-mined for natural product’s name and the names were converted into SMILES using Chemical Identifier Resolver [18
] and the resultant set of 3610 non-redundant NPs was used as our test set.
The steps shown in Figure were repeated for both training and test sets to calculate their atom signatures. To score test sets for NP-likeness, steps shown in Figure were followed. The overall scores obtained in our test study ranged from -3 to +3. The more positive the score, the higher is the NP-likeness and vice versa. The distribution of scores obtained for the compounds in the test set is shown in Figure . The distribution of the DrugBank compound set overlaps both the synthetic molecule and natural product structural space. This is expected because, in drug design experiments, the drug-like compounds often end up mimicking structural features of metabolites after the optimisation process [19
]. Only one third of the natural products space captured by us overlaps with currently available common drugs. The text-mined natural products, as expected, almost completely overlaps the training natural products structural space occupying small additional structural space.
Figure 3 Distribution of NP-likeness score for the training (synthetic molecules and natural products) and the test datasets. The synthetic molecules are a subset of the clean lead-like collection from the ZINC database and the natural products are small molecules (more ...)
To validate our scoring system, 3610 text-mined NPs with additional 5000 synthetics were scored using both our system and the original implementation by Ertl et al
]. Despite the much larger training set of the original system, the scores obtained showed a good correlation coefficient with r-value 0.94. Further, the scores obtained for the test set by replacing the training data in the original system with our open-data, showed very good correlation coefficient with r-value 0.97. Taking into account that two cheminformatics toolkits that have been used to calculate the values, differ slightly in handling of aromaticity, tautomerism, molecule normalisation etc and also slightly different types of substructure fragments, we consider this agreement very good and fully validating the new implementation of NP-likeness.