A total number of 2,322 PubChem bioassays were used in this study. The contents of the descriptions are unstructured texts that usually provide background information about the goal of the screening (such as the biological system used in the research, the significance of finding molecular modulators and their relevance to disease treatment, and experimental protocols). Assay descriptions were extracted from the PubChem BioAssay database, for which the BOW analysis was conducted.
Cosine score distribution and clustering analysis
The distribution of the cosine scores over all pairs of bioassay descriptions is provided in Figure . Two major separated areas are observed where most of the cosine scores of the studied assay pairs fall. About 95% assay pairs fall in an area with a cosine value between 0 and 0.5. Another relative small area is located in the region with the cosine value between 0.8 to 1. As it can be seen from the distribution plot (Figure ), a border line between these two regions can be drawn clearly between cosine values of 0.4 and 0.8. Since a lower cosine value of an assay pair resulting from the text-mining analysis suggests a weak relevance among the two assays, a cosine score in the range of 0~0.4 probably suggests a random relationship or weak relevance. On the other hand, a cosine value ≥ 0.8 likely gives strong indication of high relevance among the assays compared. Nearly 0.5% of all possible assay pairs fall in this region, which suggests interesting relationships among these bioassays. The region with a cosine score in the range of 0.4 ~ 0.8 represents a small fraction (less than 1%) of assay pairs.
Cosine score distribution over bioassays.
To examine the assay relevance measured by the cosine score, a manual verification and spot checking process was conducted by curating the details of the assay descriptions. The overall verification of the assay pairs reveals that the majorities of the identified assay neighbors with higher cosine score are closely related, and suggests that the cosine score serves as a strong indicator of conceptual relevance among bioassays. In particular, our result shows that the assay pairs with cosine score 0.90 or higher are directly related. For example, Bioassay AID 777 and AID 778 are identified as highly related (cosine score 0.94) in spite of the fact that their studies were on cytochrome P450 enzymes with different metabolic functions. The goal and protocol of these two assays were quite similar, which was to test the ability of the compounds to inhibit members of the P450 enzyme family for the conversion of the substrate luciferin-H EGE to luciferin EGE. By recognizing such related assays, one would be able to combine the results from the counter screenings to evaluate the inhibition specificity of the compounds towards different members of cytochrome P450 families. Assay pairs with a cosine score of 0.8~0.90 can also be highly related.
When looking into the twilight zone with cosine score range of 0.5 ~0.8, it was noticed that identification of assay pairs with a cosine score greater than 0.6 can sometimes also be of interest to reveal their biological relevance. For the set of cellular toxicity HTS assays that the NIH Chemical Genomics Center (NCGC) developed against a number of cell lines, the text-mining based method was able to group a great portion of these assays together. For example, AID 658, AID 659, AID 661 and AID 657 were clustered together as they were all designed for measuring compound cellular toxicity in human cell lines. Meanwhile AID 433, 543,540, which were designed to determine in vitro cytotoxicity, were also clustered together. The potential to cluster such assays with toxicity measurement could be useful to construct an assay panel to systematically analyze the toxicity profiles of the compounds across multiple cell lines or organisms.
Comparison of the text-mining based neighbors against the human-curated set
The human-curated bioassay neighbor set refers to the related bioassays annotated by depositors. Depositor-specified annotations were subjected to the examination of the PubChem curators during the bioassay deposition process. Depositor-specified related bioassays address various aspects of assay relationships, such as linking primary, confirmatory and counter screenings of the same assay project. Although the perspective of the depositors may vary, such annotations on bioassay relationships provide a benchmark for evaluating the recognition of bioassay relationships by the text-mining algorithm. The selection of cosine score threshold is critical for identifying significant relationships among the compared assays. There is a trade-off between the precision and the recall for optimizing the threshold. Therefore, it is essential to compare the performance of identifying related assay pairs at a series of cosine score thresholds. The result of this analysis using the depositor provided bioassay relationship as a benchmark is summarized in Table , where "precision" was defined as the ratio of true predictions over the complete predictions, whereas "recall" was defined as the ratio of true predictions over the depositor defined neighboring pairs. "True prediction" was defined as the overlap between the depositor defined neighboring pairs and those predicted by the text-mining method.
A summary of precision and recall under various cosine score threshold by comparing the text-mining based neighbors and depositor-specified neighbors.
This analysis suggests that cosine score of 0.4 can serve as a reasonable cut-off to balance the precision and recall, which agrees with the earlier analysis about the cosine score distribution. It is noted that precision at a threshold of cosine score of 0.9 is low. This is due to the limitations of the human-curated assay neighboring set where the coverage is low or incomplete. This is the case especially for the assays contributed by the NIH Molecular Libraries Program (MLP), where reports for a specific assay project are often split into many bioassay records, mostly because an assay project including follow-up experiments may take a few months to a couple of years to complete. Data produced at each experimental progress are required to be deposited in a timely manner into the central PubChem repository. Sometimes depositors tend to deposit test results from different compound libraries or from counter screenings under separate records. Thus, tracking the deposited records and providing a comprehensive linkage annotation on the overall assay relationship are burdensome for depositors, which explains one reason for the lack of a complete bioassay linkage annotation from depositor in PubChem.
A significant amount of assay relationships can be confirmed by examining the assay pairs identified by the text-mining approach through spot checking. Although these assay pairs were not specified as related at the deposition time, about 99% of the assay pairs identified at this threshold were deposited by the same assay providers. It suggests that the text-mining based method complements human annotations significantly when only a limited bioassay relationship is provided by depositors.
In many cases, our analysis also suggests that neighboring relationships from text-mining based bioassays correlate well with intrinsic relationships among bioassays. Moreover, this approach is especially efficient under conditions where other assay-clustering methods encounter limitations to apply. Bioassays AID 454, 455, 456, and 457 are related for screening compounds for enhancing/attenuating TNFa induced VCAM-1 cell surface expression with AID 457 (imaging assay) and 455 (plate reader assay) reporting compounds with augmentation effect, and AID 456 (imaging assay) and 454 (plate reader assay) reporting compounds with inhibition effect. Identifying the significant relationships among these assays would allow one to collect effective chemical reagents for the studied biological process. Unfortunately, such a relationship was not annotated by the assay depositors, and none of the other three automated assay neighboring approaches could detect this relationship due to the lack of target specification or common hits. However, with the aid of the text-mining based approach, the biologically important relevance among this group of assays were successfully identified. As another example, primary assays (AID 738, 739, 636,637) searching modulators of post-Golgi transport were first clustered together at a cosine score cut-off of 0.90, then further connected with the related dose response assays(AID 788, 789, 790) at the cosine score cut-off 0.88. This hierarchical clustering result reflects the biological relationships among the assays at three levels: the purpose of the assays, the experiment and project stage. None of these bioassays have protein target information and they have very limited active compounds in common. Thus it is very difficult for the existing automated neighboring methods to discover their relevance.
The text-mining approach compares each bioassay in PubChem against all of the rest bioassays irrespective of the data source. While this method is mostly efficient for detecting the relationship among assays from the same depositor, it was observed that nearly 1% of the related assays pairs identified are from different depositors. One such example is the AID 465 and 819 pair. These two assays came from two data sources but were recognized as related assays by the text-mining based method. Both assays were set up to identify chemicals modulating NFkB activities. Neither of them have targets defined, thus again making the existing target based neighboring method not applicable to them.
Comparison among assay-neighboring analysis approaches
There are different interests and perspectives when the enormous collection of HTS data is interpreted. Currently, there are four approaches in PubChem for neighboring bioassays, with each providing different insights into the bioassay relationship. These approaches include three automated approaches by using common biological pathway, finding sequence homology among protein targets, calculating chemical structure identity among hit compounds, and the one using the annotations from bioassay depositors.
For evaluating the new text-mining based approach, comparisons of the four automated neighboring procedures were performed and summarized below using the human annotations as a baseline.
Target similarity based bioassay neighboring
Target similarity based bioassay neighboring analysis enables one to identify assays tested against biologically related molecular targets, facilitating the construction of an assay panel for compound selectivity and specificity study. The relevance of bioassays is evaluated by the sequence similarities of their protein targets. This approach is both simple and effective in clustering bioassays. It enables the straightforward retrieval and comparison of the sequences of the assay targets. The BLASTP[38
] algorithm was employed to identify the homology between bioassay targets. On the other hand, the target similarity based neighboring analysis can only be applicable to the bioassays for which protein targets are explicitly defined. As about 40% of the bioassays in the PubChem do not contain a protein target, the target-based neighboring approach will not work for these bioassays.
Activity overlap based bioassays neighboring
An individual HTS assay for small molecules usually measures certain bioactivity properties as well as describes the bioactivity outcome for the tested compounds in a specific biological system. In order to decide whether a follow-up study is worthwhile, a compound may be tested in multiple HTS screenings assays that share common active compounds together would facilitate a comparison across multiple assays. In addition, a common group of compounds that perform similarly among different assays can be a very interesting indicator of the underlying relationship between the biological system used or the biological process monitored in the assays. Therefore the assay relationship identified by checking activity overlap or common hits could be of interest for generating new hypotheses. However, this approach is sensitive to the selection of the compound libraries tested in the assays, and may not be applicable for every assay. Since each assay may test a specific compound library, the overlap among the compound libraries is apparently the first determinant factor for this approach when neighboring assays. In addition, this method is also prone to experimental noise from HTS screenings.
Common biosystem based bioassay neighboring
In the biosystem based assay neighboring method, common biological pathways of the respective proteins or gene targets are examined. The bioassays are considered as related if their protein or gene targets participate in the same biological pathways by using the National Center for Biotechnology Information BioSystems database [39
]. This type of relationship allows one to aggregate assay results and to identify the compounds affecting a common pathway. Similarly to the target homology based assay neighboring method, this approach relies on unambiguous annotations for the assay targets or the molecular pathways studied.
Text-mining based bioassay neighboring
Unlike the other bioassay neighboring methods discussed above, the text-mining based approach does not rely on the availability of specific annotations, but utilizes the free text descriptions. Since there is no specific domain knowledge defined prior to the text-mining, the relevance of bioassays depends on the concept of descriptions. Here the underlying concept of descriptions could be the accumulation of multiple meaningful terms (such as the description about a biological process, name of the protein or gene involved, HTS screening protocol, activity type and assay readout, or methods for activity measurement).
Analysis of result comparison
One of the advantages of the text-mining based neighboring analysis is to discover the relevance that other automated approaches cannot. To provide further evaluation of the text-mining based approach, the assay neighbors identified by this method were compared to those annotated by the bioassay depositors and those suggested by the other three automated methods. Precision and recall values for each automated method were computed using the depositor provided neighbors, which is also known as human-curated assay neighbors, as the benchmark. This dataset is contributed by independent bioassay submitters, which represents expert opinions upon the pairwise bioassay relationships. As this benchmark set does not depend on any particular data elements as required by the automated methods, it provides a way to examine to what extent the automated neighboring methods are in agreement with the human curated dataset and with each other. Annotations of the depositor provided related bioassays are stored in each bioassay record (query assay) in the PubChem database if applicable. In each such annotation, one assay can be denoted to be related to one or more assays (neighbor assays) through the cross-reference data field, resulting in one or more assay pairs for each of such annotation. To construct the benchmark dataset, related assay pairs were extracted from a total of 1747 bioassay records for which the depositor provided annotations are available. These assay pairs were further grouped into 1306 clusters using the unsupervised single linkage clustering procedure. The final list of assay pairs was derived by considering all possible pairwise combinations of the assays deemed related by each clustering method, and the total number of assay pairs were derived accordingly. As a result, the benchmark dataset contained 8802 bioassay pairs, with 41% of the assay pairs containing no target information. The median cluster size was 4 and 216 clusters containing a single assay pair. The F1 score, the harmonic mean of the precision and recall values, was provided for clearer comparison of the overall performance of the methods compared to human curated datasets. In this analysis, the cosine similarity threshold for the text-based method was 0.4. The results are summarized in Table . Among all four automated methods, the text-mining based approach apparently has the best recall and precision compared to the depositor-specified neighbors.
Comparison of the four automated bioassay neighboring methods by using the depositor-defined (human-curated) assay neighbors as a benchmark.
It can be observed from Table that the three existing automated methods perform similarly well. The recall values for those three methods which are in the range between 34% and 46% are reasonable and understandable given the intrinsic limitations within these methods as discussed previously and intrinsic nature of the dataset. About 41% of the neighbor pairs in the benchmark dataset involve cell or organism based assays and contain no target. Thus, target and biosystems based methods are not able to detect this considerable portion of the bioassay relationships. On the other hand, the text-based method is not bound to any particular structured data, thus is able to recognize the relationships even among the cell or organism based bioassays. The significantly higher recall value (86%) indicates that this approach complements the existing methods remarkably.
The low precisions, which led to low F1 value, for all four methods were expected. As discussed earlier for the results shown in Table , this was largely due to the low coverage of the depositor-provided relationships. While the human curated dataset is highly reliable for deriving bioassay relationships, it has been observed that its coverage is rather limited and a great portion of true and meaningful relationships are not fully captured, which motivated the development of alternative approaches for detecting bioassay relationships including the text-mining based approach in this work. As discussed previously, the limited coverage is partly due to the fact that bioassay submissions from the same or related projects can be done over an extensive period of time; it is troublesome for depositor to track the submissions. As a result, assay depositors sometimes neglect to provide comprehensive annotations for the bioassay relationship even for assays from the same project. Secondly, individual bioassay depositors from different laboratories may work on assays against the same or biologically related targets. Since assays from different data sources are submitted to PubChem independently, the relationship among such assays is typically not recognized by the depositors. This explains why the automated methods, particularly the target and biosystems based methods, are detecting many folds of additional and meaningful bioassay relationships which literally led to low precisions. We considered the related assays from the target and biosystems based methods containing true biological relationships as these were resulted from conservative analysis based on biological sequence of targets involved in the assays. i.e there were true predictions but not annotated by the depositors. When looking into the overlap of the predictions among different methods, 62.5% of the assay pairs from the text-mining approach were further confirmed by at least one of the other automated procedures, indicating that these bioassays can be related one way or another. Furthermore, the majority of the novel pairs detected by the text-based method involve assays contained no target specifications, which again demonstrates that the text-based approach may play a critical role in detecting bioassay relationships that are otherwise impossible for the existing methods to recognize.