Developers commonly try to replicate Swanson's early findings as a means of system appraisal because (a) much of Swanson's work has been validated independently and empirically by clinical researchers and (b) no other agreed-upon criteria exist, with the exception of expert opinion regarding relevancy of results and feasibility of hypotheses. In this context, appraisal implies evaluation of the goodness of sets of discovered hypothetical relationships. If no other criteria for demonstrating validity exist, evaluation must await tests by empiricists who happen to find the results interesting [9
]. This is a major problem for developers of hypothesis generating systems.
However, a variant of this approach is possible. Developers could work retrospectively on other well-known, empirically validated phenomena by mining the relevant literature up to meaningful cutoff dates. The goodness of the results sets would depend on whether known causal or temporal relationships are recovered. This is similar to using Swanson's early findings as evaluation criteria, but opens the discovery process to other domains in basic and applied research, such as molecular biology, chemistry, physical therapy, nursing, or public health.
Regardless of disciplinary focus, it is probable that researchers will want to retrieve and merge information from several kinds of databases. This assumes continued interest in interdisciplinary research and expansion of overlapping databases. For example, to glimpse the interconnectedness of databases already available for molecular biologists, visit the National Center for Biotechnology Information website [38
] and select one of the nodes of the graphic for Entrez, the integrated system for search and retrieval. This leads to a display of the number of links between databases. These are not symmetric – for example, the number of links between PubMed and Cancer Chromosomes depends on whether one selects PubMed (8,016 links) or Cancer Chromosomes (50,051). This asymmetry will have an impact on future merging and mining efforts.
Currently, hypothesis discovery systems are still in the early phase of development, at least from the perspective of potential users. Nevertheless, in addition to probing appropriate methods for extraction and analysis, it would behoove developers to participate with research teams working on substantive rather than methodological problems. Otherwise, the mainstream biomedical community will ignore results of SL studies, no matter how fascinating.
Additionally, the phrases 'hypothesis testing' and 'knowledge discovery' in the context of text mining are not credible to experimentalists trained in the positivist tradition. Since the appropriate use of text-based, discovery methods is exploratory and therefore useful in early phases of research programs or in proof-of-concept studies, a more general phrase, such as 'exploratory mining' might be more acceptable.
Once the role of discovery methods in research programs is clarified, partnerships with the biomedical community will develop apace. Certainly, the timing is auspicious given greater acceptance of conceptual and computational biology, as well as rapid development of text mining tools. As an example of growing awareness of the potential for discovery methods, consider the following comment by the Director of the Office of Scientific Interchange at the National Institutes of Arthritis and Musculoskeletal and Skin Diseases:
The comprehensive overview of an entire literature with respect to a single question is now in transition. New tools in informatics are making it possible to fuel the search for biomarkers for SLE [systemic lupus erythematosus] ... rapidly and with nuance. Rather than looking for articles using the same key words, or for bibliographic citations in a work of interest, the entire database of medical literature can be probed.... (pp. 223–224) [39
In support of her suggestion for an informatics-driven review of literature, Mittleman cites several Swanson papers and therefore is aware of the origins of text mining for discovery. It seems clear that Swanson's vision of the hidden value in the literature of science and, by extension, in biomedical digital databases, is still remarkably generative for information scientists, biologists, and physicians. Innovative librarians and information professionals could respond to the changing information needs of their patrons by monitoring developments in KDT, and by acquiring the necessary skills to help patrons locate and mine appropriate databases. Major health sciences libraries could build computational biology centers modeled after Princeton University's Data and Statistical Services (DSS) in the Harvey S. Firestone Memorial Library. Although the DSS unit is not dedicated to biology, the idea of offering consulting services to a particular community is apropos. Even without a dedicated center, one or more librarians could be trained in KDT methods to help biomedical researchers and conceptual biologists locate information useful for generating and testing hypotheses.