In this paper we described CoPub Discovery, a web-based tool that mines the Medline database for novel relationships between genes, diseases, drugs and pathways. The results show that using hidden relationships, we can successfully identify novel disease-related genes, generate novel hypotheses on drug mode of action and predict novel lead compound applications.
Drug discovery is a difficult and time-consuming process. Despite the strong increase in funding of research and development the last decade, the number of drugs that reach the market each year is lagging behind 
. Several strategies have been adopted to bridge this gap. The use of systems biology for gaining better knowledge on the mechanisms of drug action and toxicity 
and the use of biomarkers that are predictive for a certain biological outcome 
, are widely used solutions to improve decision making. In addition, drug repositioning, which is the use of existing drugs for new applications, is another area that is gaining much attention as a means to boost drug development 
. Several text mining solutions have been developed to assist in and speed up the above strategies.
In a recent paper, Compillos et al.
showed how text mining of drug labels, can be used to infer whether two drugs share the same target 
. Our study identified several novel targets for known drugs, based on a different algorithm and another text corpus. This indicates that mining of literature is an interesting and fruitful approach to identify new drug-target relations, a first step in developing drugs towards new applications.
Detailed knowledge of the mechanism of action of a drug and the biological processes that are targeted by a drug is of importance for fine tuning drugs and biomarker discovery. In an earlier study, we showed that the application of text mining on expression data from a toxicogenomics experiment yielded detailed insight in the mode of toxicity of the tested compounds 
. With the hidden relationship algorithm presented in this paper we provide a text mining tool that is independent of gene expression data, to improve the understanding of a drug's mechanism of action and the pathways targeted by that drug.
Although CoPub Discovery is successful in identifying novel, biologically relevant relationships in literature, several improvements may be envisioned. For example, incorporation of additional evidence for true relationships between concepts from sources other than literature, such as protein-protein interaction data or gene co-expression data, could help prioritize relationships by biological relevance. Furthermore, an additional measure of confidence could come from analyzing the relationships between the intermediates that connect A and C. A highly interconnected set of intermediates could indicate/validate higher biological relevance compared to a set with few interconnections.
Co-occurrence-based text mining does not capture the type of the extracted relationships (e.g. A binds, blocks, induces B). Therefore, in the CoPub Discovery web server the results are linked to the original abstracts in which the relationships were found. This enables the scientist to read the facts to uncover the type of relationship between A and C. A good starting point for discovery would be to look for intermediate nodes (B) that have the highest R-scaled scores for both node A and node C, because they have the strongest link between A and C. After selecting a few of these nodes, the researcher can perform a detailed analysis on the functional association between A and C by reading the abstracts in which A and B, and B and C are mentioned. Additionally, incorporation of natural language processing in hidden relationship analysis could assist in determining the type and direction of the relationship between A and C.
In the validation procedure of CoPub Discovery using ROC curve analysis we define FPs as A–C relationships that are predicted in the literature before the year 2000 that were not detected in subsequent literature. It might be well true that a FP is in fact a novel discovery, but is not yet discovered in subsequent literature. Furthermore, one can argue that a high area under the curve (AUC) score indicates that CoPub Discovery discovers very little that would not have been eventually discovered without it. In this respect, the 6.5 year time lag between the CoPub Discovery and the report in literature may be more indicative of the true value of CoPub Discovery; it significantly speeds up hypothesis generation, filtering and testing as was demonstrated in case example 4 in which we exactly followed this approach.
Evaluating the ROC curves in light of the performance of other text mining tools is hampered by the fact that not all of the tools are accessible or work on different text corpora or use different thesauri. Development of tools for discovery of hidden relationships would benefit from the use of expert-curated test and training sets on well-defined literature corpora, as is done in the BioCreative text mining challenges.
The statistical underpinning of CoPub Discovery provides a significant advantage over existing text mining tools applied in the area of drug development 
. It allows confidence level calculations for hidden relationships and facilitates the discrimination of biologically relevant from biologically less interesting hidden relationships. To ensure the quality of the hidden relationships, several stringencies were placed on the biomedical concepts used in CoPub Discovery. For example, the biomedical concepts used in literature mining were all pre-tested for false positive generation upon inclusion in one of the biomedical concept thesauri. Furthermore, only genes and biological processes are allowed as intermediates, which avoid relationships being formed by non-informative concepts, such as ’protein‚, ’cell assay‚, etc.
In short, the results in this paper show that CoPub Discovery is able to identify novel associations between genes, drugs, pathways and diseases that have a high probability of being biologically valid. The fact that this is done rapidly, in an automated way, makes the tool especially useful in areas where large amounts of data need to be analyzed. A typical use of this tool could be to quickly rank potential new biomarkers obtained from e.g. a microarray experiment, based on their relation to diseases and drugs. CoPub Discovery could also help in drug repositioning in which list of drugs are clustered and ranked on basis of their relation with diseases and biological processes of interest.