The number of textbooks related to patient care and basic science is already large and continues to grow. In areas of health care with rapidly evolving or intricate management strategies, textbooks constitute a critical resource for health providers. In 1991, Forsythe and colleagues observed clinicians on rounds and in clinics, recording 454 clinical questions that arose and the ultimate sources of answers to these queries.
1 The clinicians obtained answers to many questions by consulting multiple information sources, most often the patient’s medical chart or records or the patient himself. Answering a quarter of the questions, however, required consulting medical library materials. Of these, the answers to more than half were found in a textbook and the remainder in medical journals. In another study, internists’ self-reported use of various information sources roughly matched these figures.
2 During outpatient treatment of AIDS patients, five textbooks, in conjunction with a system for searching medical journals, provided answers to approximately 75% of clinicians’ questions.
3However, the use of textbooks is frequently neither straightforward nor expedient. Searching a textbook consumes on average six minutes of time that could be used for other clinical care tasks.
4 Faced with an urgent information need, clinicians often must rely on manual inspection of a table of contents or alphabetized keyword index to guide their search. Although formal evaluations of such indexes are lacking, in one study only 30% of internists felt that textbooks are “adequately indexed for rapid information retrieval.”
2 Tables of content most commonly list only one or two levels of section headings, each indexed by the page number on which the section begins. Alphabetized keyword indexes, like those commonly found at the end of most textbooks, may cross-reference two or more words that occur in the textbook and therefore are usually more precise. However, like tables of content, most of these indexes also direct the reader only to whole pages in the textbook; readers are still left with the time-consuming task of finding specific answers to their questions in the text of those pages.
Many textbooks have recently become available in electronic format (e.g., on CD-ROM). Some of these textbooks can now be searched by locating terms that occur in the text, either alone or in some predefined proximity to each other. However, this type of indexing alone is unlikely to improved textbook search precision. Hersh found the precision of information retrieval from one medical textbook using such term search methods to be an abysmal 19%.
5 In addition, the organization of some electronic medical textbooks is such that low cognitive load tasks (e.g., visual scanning) cannot be performed as easily as with printed versions, even with key term highlighting.
6Indexes that would allow clinicians, researchers, and patients to retrieve the information they need from these sources rapidly and with greater precision must contain more knowledge than merely the location of the beginning of textbook sections or the numbers of pages on which one or two concepts are discussed. Entries in these indexes must mirror the questions that drive readers to use the textbook to seek knowledge. Furthermore, these indexes must point the reader to more specific locations in the text. For example, consider a resident physician with the specific question “What is the appropriate duration of therapy for the treatment of a patient with Pseudomonas pneumonia using aminoglycosides?” A traditional alphabetized keyword index might contain an entry for “pneumonia” and “therapy” that points to several different pages in a textbook, only a minority of which contain discussions of the length of treatment of pneumonia caused by Pseudomonas species. A superior index would allow residents first to find their exact question in the index and then to find the specific portions of a textbook that contain answers.
Both the creation and the use of such detailed and specific indexes present several challenges. First, they are potentially much larger than traditional keyword indexes, because they must include entries for large numbers of different and specific, complex questions readers have, and the specific locations of answers to these questions. Indeed, for many textbooks, such indexes may be so large that they themselves require a system for navigating to a desired question in the index to be of any practical use. Second, the amount of labor required to generate such indexes manually is likely to be very large, because more extensive coordination of indexed terms and more specific reference back into the text are required. Finally, the nature of such detailed, query-based indexes requires that indexers have significant domain-specific knowledge, particularly an understanding of the proper and specific relation between index terms and the ability to recognize index terms in the text that are implied, but not stated.
We have developed ELBook, a computer-based system for retrieving fine-grained (i.e., highly specific) information from text documents.
7 ELBook requires that domain experts make explicit as indexes some of the knowledge contained in the text of documents. However, unlike the pioneering Hepatitis Knowledge-Base,
8 ELBook does not require indexers to provide a representation of all the knowledge contained in its documents. Query and concept models constrain the space of possible ELBook indexes and queries. Thus, the design of ELBook builds on the attempts by Hersh, Purcell,
9 and others to incorporate more domain-specific knowledge into full-text IR systems (Table 1).
| Table 1 Full-text Information Retrieval Systems and Selected Characteristics of Their Indexing Methods |
Enthusiasm for the retrieval capabilities of ELBook has consistently been tempered, however, by the realization that generating indexes for ELBook without computer-based support is extremely time-consuming.
10 Furthermore, to evaluate the performance of ELBook on any test collection, we require a substantial amount of such precise indexing. If indexes of this type cannot be generated with sufficient accuracy and in an acceptably efficient manner, increases in search precision or recall using these indexes would be moot. Thus, we set out to develop a computer-based system that could provide automated support for indexing full-text documents for use in ELBook.
Several researchers have attempted to automate some or all of the process of generating indexes for various types of full-text documents. Investigators at the National Library of Medicine (NLM) have described both semi-automated
11 and fully automated
12 indexing systems designed for journal publications. MedLEE, a natural language understanding system, can extract concepts from clinical notes and reports with reasonable accuracy that can then be used as indexes, although modeling domain knowledge for specific applications remains a bottleneck.
13–16 Furthermore, NLP systems like MedLEE typically do not provide support for interactive document indexing; such interaction could improve the accuracy of index generation through human review.
This article evaluates ISAID (Internet-based Semi-automated Indexing of Documents), a computer-based system to generate textbook indexes that are more detailed and, hopefully, more useful to readers. This system requires domain-dependent query and concept models as well as a domain-independent document model to provide some of the knowledge required to create such complex indexes. ISAID requires that a domain expert first describe a set of questions, or generic queries, to be used as templates for indexes. Collectively, these questions constitute the query model. The concept model that ISAID uses for the medical domain is largely based on the Unified Medical Language System semantic network. The document model is based on the explicit and implicit structure of Hypertext Markup Language (HTML) documents. ISAID uses a modified vector-space model to help propose candidate indexes. We performed limited comparisons of the ability of ISAID users to generate indexes versus a manual indexing system, and then proceeded to evaluate the contributions of the document and vector-space models to the indexing process. We examined the consistency and speed of indexers using ISAID as a necessary first step towards the evaluation of ELBook.