|Home | About | Journals | Submit | Contact Us | Français|
The availability of scientific bibliographies through online databases provides a rich source of information for scientists to support their research. However, the risk of this pervasive availability is that an individual researcher may fail to find relevant information that is outside the direct scope of interest. Following Swanson’s ABC model of disjoint but complementary structures in the biomedical literature, we have developed a discovery support tool to systematically analyze the scientific literature in order to generate novel and plausible hypotheses. In this case report, we employ the system to find potentially new target diseases for the drug thalidomide. We find solid bibliographic evidence suggesting that thalidomide might be useful for treating acute pancreatitis, chronic hepatitis C, Helicobacter pylori-induced gastritis, and myasthenia gravis. However, experimental and clinical evaluation is needed to validate these hypotheses and to assess the trade-off between therapeutic benefits and toxicities.
Researchers practice science in highly specialized intellectual environments and communities. It takes years of training to contribute new discoveries to the continually growing body of scientific knowledge. The staggering amount of available online knowledge and information, concerning both the personal field of expertise and developments in other disciplines, may overwhelm the individual researcher. Moreover, the latter, relevant, information may be overlooked. Swanson has phrased this particular situation as one of “complementary but disjoint structures within the literature of science.”1 Assume that one set of literature discusses the argument that A relates to B, while a separate literature provides a discussion on how B relates to C. These sets may have no articles in common and are therefore disjoint, yet the arguments are complementary because, when they are combined, there is the inference that A relates to C. Using titles from the bibliographic database MEDLINE, Swanson has made several discoveries in biomedicine by connecting disconnected sets of literature according to this simple ABC model.2–4 Importantly, the first two literature-based hypotheses of the therapeutic effects of fish oil for patients with Raynaud’s disease and the role of magnesium deficiency in migraine were corroborated clinically.5 For drug discovery, this model is similar to Vos’ model,6 in which discovery is the rapprochement of a drug profile (AB) to a disease profile (BC).
Following Swanson’s ideas of connecting disconnected sets of literature, we developed a literature-based scientific discovery support tool that applies advanced Natural Language Processing (NLP) techniques to MEDLINE citations.7,8 This tool provides the means to efficiently analyze huge amounts of textual data and to generate new hypotheses in the biomedical domain. The hypotheses can then be evaluated bibliographically by studying computer-selected MEDLINE titles and abstracts.
In this case report we have generated new knowledge in the domain of drug discovery by finding new hypothetical therapeutic applications of the drug thalidomide. Between 1959 and 1961, thalidomide (α-N-phtalimido-glutarimide) was a popular over-the-counter sedative. Devastating teratogenic effects led to its withdrawal from the market. In recent years, however, interest in thalidomide has intensified based on its reported immunomodulatory and anti-inflammatory properties.9–12 In 1998, the Food and Drug Administration (FDA) approved thalidomide for the indication of erythema nodosum leprosum, an inflammatory manifestation of leprosy.13,14 Addition-ally, thalidomide seems to have beneficial effects on ulcers and wasting associated with HIV infection.11,13,14 The observation that new knowledge can be generated on a relatively old drug and that research is still ongoing led to our hypothesis that there may be new, not yet discovered applications for thalidomide.
In earlier publications, we reported on the development of a concept-based discovery support system.7 Using this system, we were able to simulate, or rediscover, two of Swanson’s discoveries (i.e., the association between the fish oil and Raynaud’s disease and the association between magnesium deficiency and migraine).8 These papers provide more details of the system. The most salient feature of the system is that it is “concept-based.” This means that it applies NLP techniques to identify biomedical concepts15 in PubMed titles and abstracts at <http://www.ncbi.nlm.nih.gov/PubMed/>. The concepts originate from the Unified Medical Language System (UMLS) Metathesaurus,16the largest biomedical thesaurus to date. The use of concepts instead of text words has several advantages. First, different expressions such as variants or synonyms collapse to one concept. For instance, IL-12, IL12, interleukin 12, CLMF, cytotoxic lymphocyte maturation factor, and natural killer cell stimulatory factor all refer to the same concept:Interleukin-12. Second, finding meaningful strings of multiple words (phrases, compounds) is nontrivial in NLP. By using concepts from the Metathesaurus, only biomedically relevant (multiword) strings are used, whereas the rest are discarded as noise. The most important reason to use UMLS concepts, however, is the semantic information that is added to them by human experts. All concepts have been assigned to one or more semantic categories or types. There is a total of 134 categories, including “Disease or Syndrome,” “Gene or Genome,” and “Amino Acid, Peptide, or Protein.” The conceptThalidomide, for instance, has been assigned the semantic types “Organic Chemical,” “Pharmacologic Substance,” and “Hazardous or Poisonous Substance.” In our discovery procedure we use this semantic information as a filter to reduce the size of the search space.
We have divided the discovery process into two steps: the generation of a hypothesis and its subsequent evaluation. In the hypothesis generation phase, only one pathway (represented byB in theABC model) is pursued because of the potentially huge search space. Once a hypothesis that A relates toC has been generated, a search in a more restricted space may uncover additional pathways (B) between A and C and thus strengthen (or reject) the hypothesis.
Two authors of this paper (MW and GM), an information scientist and a pharmacologist/immunologist, respectively, performed the actual discoveries and formulated the proposed hypotheses. MW was the main developer of the system and used it to retrospectively (re)discover Swanson’s most famous discoveries.8 GM was involved as the prototypical user of a discovery support system. As a researcher with a background in pharmacology and immunology, she acknowledged the overflow of information in current biomedical research practices and was interested in putting the system to a real-life test. GM’s pharmacological domain knowledge of thalidomide was of a general kind. She knew the recently discovered effects of thalidomide on TNFα mRNA degradation and its role in suppressing pro-inflammatory processes. GM’s immunology expertise is broad, but she has no specific background in any of the mentioned diseases.
The search processes and the interaction with the discovery support system were mostly done in collaboration between MW and GM. It consisted of several one-hour sessions during a two-week period. Several search parameters and cut-off settings were based on pragmatic issues such as reducing the list of possibilities to a manageable amount (not too long) without using too much of a special focus (not too short).
We started with thalidomide as A in the ABC model (Fig. 1). The discovery tool downloaded PubMed titles and abstracts mentioning thalidomide (and its variants and synonyms) on July 27, 2000 and mapped the natural language texts to UMLS concepts. Subsequently, we applied our semantic filter. We selected only UMLS concepts classified as “Immunologic Factor” from sentences mentioning Thalidomide because we hypothesized that we may find new therapeutic applications through the immunologic actions of the drug. The system provided a rank-ordered list of immunologic concepts as output. The discovery tool has been developed in such a way that the user also can view each of the immunologic concepts in the original context. Based on background knowledge and the provided bibliographic information, we selected promising concepts that are the Bs in the ABC model (see Fig. 1). “Promising” was defined on two grounds. First, the concepts were ordered by frequency. The higher the frequency, the more likely it is that this factor is related to thalidomide. Also, it provides us a substantial amount of possible textual “evidence” of this putative relation. Second, the expert used her knowledge to look for promising novel pathways.
The discovery tool downloaded the PubMed citations on the most promising Bs and identified the available concepts. Again, a semantic filter was applied. This time, we were interested in diseases: only concepts classified as “Disease or Syndrome”were presented, representing the C concepts in the ABC model (see Fig. 1). The system presented the sentences in which the A and B concepts and the sentences in which the B and C concepts co-occurred in a juxtaposed manner to facilitate the user to make the inference that thalidomide (A) may be effective in disease C through immunologic pathway B.
As the list of diseases was very long, we adopted the following strategy to reduce the number of hypotheses. First, we selected diseases of which the literature had more than two immunologic B concepts in common with the thalidomide literature. Lack of co-occurring immunologic concepts represented a lack of useful immunologic pathways and therefore was not considered interesting. Subsequently, we executed automated PubMed queries on thalidomide combined (“AND” in PubMed) with every disease C. Of note, our tool searched only in titles and abstracts, thus ignoring Medical Subject Headings (MeSH), and AB and BC relations were identified as existing only if the concepts co-occurred in one sentence. This means that our search was more restricted (and more precise) than a normal PubMed search because by default PubMed searches in all available fields. If the number of hits on a default PubMed query was larger than three, we discarded the finding based on the assumption that the relation between A and C was known and hence did not represent a new discovery. These cut-off values were mainly based on our aim to reduce human workload. With the current thresholds, about one hundred diseases had to be assessed by the expert. Using the support system’s contextual presentation of the bibliographic “evidence,” this was a manageable task. When other cut-off values are used, the list will either contain too many diseases or not represent novel possible therapeutic applications.
For the remainder of the diseases, we studied the output of the discovery system on the putative pathway between thalidomide and disease C, using the extracted sentence, the complete PubMed citation, and the full text paper, if available. At this stage, we had generated a list of diseases for which we had found some bibliographic indications that thalidomide might be an effective pharmacologic agent.
In the evaluation process we tried to find additional bibliographic and other evidence for the putative pathways between thalidomide and the listed diseases. The first step was to download and analyze citations on each of the diseases (C; see Fig. 1) found in the hypothesis generation phase. From these citations the discovery system selected immunologic factors (B) that co-occurred in sentences with this disease. The B concepts also had to co-occur in the thalidomide literature in sentences that contained the concept Thalidomide. The system listed these concepts and provided the juxtaposed AB and BC sentences for human expert assessment. This process strengthened some hypotheses while rejecting others.
In addition to PubMed, we also queried other databases available at Groningen University: Biological Abstracts (from 1990 to 2000/06), CINAHL—Nursing & Allied Health (from 1982 to 2000/06), EMBASE—Excerpta Medica (from 1989 to 2000/06), and Current Contents (from 1997 to 2000/07/28). Additionally, we queried the Internet through Altavisa at <http://www.altavista.com> and Google at <http://www.google.com>.
The discovery system downloaded 1,366 citations that included (a variant of) the concept Thalidomide in either title or abstract. These variants were “thalidomide,” “sedoval,” “synovir,” and “kevadon.” In the sentences in which Thalidomide occurred, 3,860 different concepts co-occurred. Only 82 of them were assigned the semantic type “Immunologic factor,” a reduction of 98%. Table 1 shows the 31 concepts that had a frequency of occurrence higher than two. A lower frequency indicated that the concepts were hardly mentioned in association with thalidomide and were not investigated further. Concepts such as Antigens, Antibody, or Chemotactic factors are too general to be useful. “General” corresponds loosely with the position of the concept in the thesaurus hierarchy. The higher a concept is located in the hierarchical tree, the more general it is. Because we are interested in specific and testable hypotheses, more precise concepts are preferred.
The high frequency of the concept Tumor necrosis factor (TNFα), illustrated thalidomide’s well-known characteristic of inhibition of TNFα production via increased TNFα mRNA degradation. The search also provided the concepts Interleukin-12 and Interleukin-10 (IL-10) (Table 2). The domain expert almost instantly selected these concepts for further exploration. Because of her immunologic knowledge about the role of different cytokines in the T-helper cell differentiation, she appreciated thalidomide’s pharmacological potential in this immunologic process.
Thalidomide has strong inhibitory effects on mononuclear cell production of IL-129,17and a stimulatory effect on IL-10 production.10IL-12 is a pleitropic cytokine that favors the differentiation of T-helper 0 (Th0) cells into T-helper 1 (Th1) cells; hence the cellular response against cells expressing auto-antigens.18The subsequent induced production of IFNγ by T-cells and natural killer (NK) cells, furthermore, enhances general immune cell activation, growth, and differentiation.19Based on this bibliographic information, schematically depicted in Figure 2, we selected the concept Interleukin-12 as the immunologic pathway that may result in discovering new applications for thalidomide.
The discovery system downloaded and analyzed the 3,846 MEDLINE citations in which theB concept Interleukin-12 occurred 16,262 times. In the sentences with IL-12, 5,707 different other concepts occurred, of which 420 had the semantic type “Disease or Syndrome,” a 93% reduction. During the filtering process, as described in the Methods section, we found that for some cases the information indicated that the pharmacologic pathway would exacerbate instead of alleviate the disease. We also found NLP errors in the mapping process. The number “pi,” for instance was incorrectly mapped to the disease Pulmonary valve insufficiency. Additionally, some disease concepts were considered too general (Critical Illness), whereas others were too specific. For instance, the retrieved concept Salmonella infections did not have any PubMed citations in common with thalidomide. However, the PubMed query “salmonella AND thalidomide” returned nine citations, indicating that thalidomide has received interest in a salmonella context.
Interestingly, in the filtering process we were able to strengthen the hypothesis put forward in one citation that multiple sclerosis (MS) might benefit from thalidomide. The reported putative pathway concerned thalidomide’s inhibitory effects on TNFα synthesis, thereby preventing acute exacerbations.22 In the underlying study, we found bibliographic evidence for the importance of the IL-12 in this disease.21,22 We also found that a shift in Th1/Th2 balance might be desirable to improve disease status.23
By using the described filtering techniques we were able to obtain a list of 12 diseases that may benefit from thalidomide treatment (see Table 2). In other words, we generated 12 new hypotheses for thalidomide use.
The second column of Table 2 provides the number of PubMed citations that the discovery system analyzed for each of the 12 diseases. For some of these diseases, the bibliographic evidence was not strong because there was either not enough information or the information available was contradictory. For other cases we assumed that there was too much “circumstantial bibliographic evidence” to claim a new discovery (i.e., in diseases closely related to the identified diseases for which thalidomide had already been investigated extensively). For instance, at the time no link in the literature indicated that patients with atherosclerosis may benefit from thalidomide. However, this disease is often associated with systemic lupus erythematosus, a disease for which thalidomide treatment has been investigated. Similarly, atherosclerosis is characterized by angiogenesis (i.e., new blood vessel formation from pre-existing ones), a complex cellular phenomenon for which thalidomide’s regulatory properties are widely acknowledged. In a similar manner we judged that there was too much circumstantial evidence for thalidomide application in Sjögren’s syndrome and the closely related disease sialadenitis. Indeed, a search in the NIH clinical trial database at <http://ClinicalTrials.gov> in January 2001 showed that a phase II clinical trial was being conducted on thalidomide in patients with Sjögren’s syndrome.
Based on the discovery support system’s output (sentences with co-occurring relevant topics, PubMed abstracts, and sometimes full papers available through publishers’ websites) we compiled a list of four diseases for which we hypothesize that thalidomide may be an effective treatment (Table 3).
Chronic hepatitis C (CHC) is an inflammatory disease of the liver caused by the hepatitis C virus. The Th1/Th2 cytokine balance is important in persistence of infection and liver injury in CHC.24 Progressive liver injury in CHC is associated with upregulation of intrahepatic Th1 cyto-kines (IFNγ, IL-2),25 downregulation of Th2 cytokines,26 and the induction of IL-12.27 Besides a pronounced production of the Th1-associated cytokines, circulating levels of the pro-inflammatory cytokine TNFα also correlated with the degree of inflammation in the liver.28,29
Myasthenia gravis (MG) is an organ-specific autoimmune disease, putatively aimed at the nicotine acetylcholine receptor (AChR). MG afflicts the neuromuscular junctions.30 The pathophysiologic background of MG is complex and at present mostly unknown. Aberrant production of, among others, TNFα, IL-10, and IL-12, after antigen challenge is considered to play a role in the pathology.31,32 In addition, a dysbalance in Th1/Th2 cell activation has been implicated in the disease.33
Gastritis is an inflammation of the lining of the stomach that can be caused, for example, by prolonged irritation from the use of non steroidal anti-inflammatory drugs and infection with the H. pylori bacteria. Most MEDLINE abstracts concerned H. pylori-induced gastritis. In a mouse model, H. pylori-induced mucosal inflammation was shown to be Th1-mediated, with disease exacerbation in IL-4, but not IFNγ, gene-deficient mice.34 A failure to promote Th2 relative to Th1 responses would impede resolution of the infection and promote chronic inflammation. Furthermore, treatment of mice with IL-12 or anti-IFNγ antibodies, respectively, decreased the severity of gastritis, while adoptive transfer of Th1 cell lines significantly exacerbated the inflammation.35
In acute pancreatitis (AP), serum levels of IL-1, IL-6, and especially TNFα have been demonstrated to be elevated.36,37 Additionally, administration of the Th2 type cytokine IL-10 exerted a protective effect in experimental AP.38,39 The role of IL-12 in AP has not been investigated intensively, but the few citations available suggest that increased amounts of IL-12p40 in AP patients may be responsible for their increased susceptibility to infection.40 This finding suggests a differentiation toward a Th1 type of immune reaction in patients with AP. PubMed provided one reference that co-cited thalidomide and pancreatitis.41 In this case, however, pancreatitis was supposed to be a previously unreported, side effect of thalidomide use.
EMBASE provided one citation with regard to chronic hepatitis C and thalidomide.42 In this poster presentation, it was observed that in patients with multiple myeloma who were treated with thalidomide, two responders had a concomitant improvement in CHC. The authors did not provide any explanation of this observation. EMBASE also included one citation on thalidomide treatment of myasthenia gravis, a Brazilian conference paper in Portuguese.43 A web search resulted in one interesting page that described new potential treatments for thalidomide: “We also are looking at thalidomide in Alzheimer’s disease and myasthenia gravis.”44 EMBASE, furthermore, provided several new references with regard to pancreatitis and thalidomide. Interestingly, most of these references were also indexed in PubMed. It turned out that the difference in index terms caused the positive identification by EMBASE. These four citations seemed to suggest that pancreatitis is a side effect of thalidomide; however, closer inspection revealed that pancreatitis was put forward as a side effect caused by corticosteroids rather than by thalidomide.
In summary, only a little information is available about a relation between thalidomide and chronic hepatitis C and myasthenia gravis. We have found strong bibliographic evidence that suggests a mechanistic (immunologic) point of view on thalidomide and chronic hepatitis C and myasthenia gravis that is novel and has not been reported yet. Furthermore, we have not found any direct bibliographic indication that H. pylori-induced gastritis and pancreatitis may benefit from thalidomide. Therefore, these diseases represent truly novel potential therapeutic targets for thalidomide. All of these diseases have been related to a dysbalance in Th1/Th2 immunologic responses favoring Th1 reactions. However, we also found some references on the opposite situation, i.e., the production of excessive Th2 cytokines in the disease of interest. For example, a small number of studies indicated that the Th1/Th2 balance in CHC tends to favor Th2 responses.45,46 Whether the observed elevated Th2 cytokine levels represented a systemic response or are a result of increased local production within the liver46 remains unknown at present. Furthermore, it should be taken into account that disease initiation and subsequent progression may be differentially controlled by Th1 and/or Th2 specific cytokines and that patients can become prone to other infections when a Th1 response is therapeutically eradicated. Only experimental data about the effects of thalidomide will provide the final evidence whether the diseases identified using our discovery support tool truly benefit from the drug.
We have reported on the application of a computer discovery support tool to identify complementary but disjoint structures within the literature of medical life sciences regarding disease profiles that may benefit from thalidomide. We argued that thalidomide shifts the Th1/Th2 balance toward Th2 in human immune response. In theory, autoimmune diseases that are characterized by a Th1 differentiation may benefit from thalidomide use. Because of the complexity of the processes in the immune system, it was difficult to grasp the molecular changes induced by thalidomide when applied to the large variety of diseases known to date. However, the literature-based discovery support system successfully facilitated the formation of hypotheses regarding some of thalidomide’s mechanistic pathways. The current investigations provided only bibliographic evidence for positive pharmacologic effects of thalidomide in the diseases identified. Experimental and clinical evaluation of therapeutic benefits versus toxicities should shed light on their potential use for the treatment of various pro-inflammatory diseases that have not been studied in this context yet.
As we executed our discovery task in the end of July 2000, we now have two more years of medical research and publications to evaluate our proposed hypotheses. The application of thalidomide as a potential treatment for chronic hepatitis C has been a subject of recent discussion.47 Additionally, it is interesting to observe that one of the reasons of using INFα for treating hepatitis C is the inhibition of INFγ, which is also one of the pharmacologic effects of thalidomide. To complicate matters, however, clinical trials are also being conducted to use INFγ for IFNα nonresponders (see <http:// ClinicalTrials.gov>). This is a further illustration of the human immune system’s complexity and also suggests careful pre-clinical and clinical testing to truly evaluate the generated hypotheses.
New knowledge about existing drugs is generated continuously. Famous examples of new indications for existing drugs include acetylsalicylic acid (aspirin) as prophylaxis for myocardial infarction and colorectal cancer,48 minoxidil for male pattern baldness,49 and sildenafil (Viagra) for erectile dysfunction.50 The discovery of these new applications originated from clinical observation, theoretical reasoning, and serendipity. With our discovery support system, we have extended the process of drug discovery by generating hypotheses in a systematic manner by combining existing knowledge. This aspect of knowledge re-use may result in a more rapid identification of other potential applications for drugs as well as identification of patterns of side effects, for example.
Working with our tool is more complicated than executing a general PubMed search. It is an intellectually intensive process in which a domain expert has to continually evaluate hypotheses to reduce the long initial list to strong, testable ones. An example of the indispensability of expert knowledge concerns the discovery of myasthenia gravis as target disease for thalidomide. During the analyses, one PubMed citation was found that co-mentioned thalidomide and experimental myasthenia gravis.51 This study, conducted in Lewis rats, did not show any pharmacologic effect of thalidomide. A likely reason for this is the fact that Lewis rats respond in a typical Th1 fashion after immunologic challenge.52 In a system in which the Th1 response prevails, pharmacologic interference at this level may not be sufficient to obtain an effect. The expert knowledge enabled us not to identify this citation as a refutation of our hypothesis.
This example of the human expert’s indispensability is typical to our view of literature-based discovery. Computational tools assist human domain experts in studying the literature and deriving and evaluating novel hypotheses. Using tools such as our system may lead to more focused laboratory experiments and may make new and fruitful connections between scientific disciplines. The scientific environment of biomedical research is characterized by a significant increase of data and information from highly automated experiments and online databases. Intelligent and computer-assisted approaches to sift through this information will become necessary tools on the researcher’s bench in the near future.
The authors are grateful to James G. Mork from the National Library of Medicine for providing access to the NLP tools.