How might one apply the theories previously described in developing a comprehensive "real world" information retrieval and knowledge discovery system? As reviewed in the previous section, the Telemakus system is built on and extends prior research in the areas of concept representation, schema theory and information visualization. Work on components of what has become the Telemakus system has been underway for many years with a particular emphasis on the importance and utility of relationships extracted from data tables and figures [24
]. Fuller [12
] identified key objective elements important in representing a clinical research report and developed a schematic representation. The clinical trials schema has been adapted for representing basic sciences research reports in the Telemakus system.
Based on how scientists use and want to use the research literature, Telemakus brings together three innovations in analyzing, displaying and summarizing research reports across a domain:
1. Research Report Schema: Research methods and findings are extracted and presented in a consistent, coherent and structured schema format which mimics the research process itself and provides a high-level research report surrogate to facilitate searching as well as rapid review of retrieved documents.
2. Research Findings extracted from data tables and figures are used to index the documents, allowing searchers to request research studies which report a relationship between two concepts of interest.
3. Visual Exploration Interface provides a dynamic map of extracted research findings to graphically display what is known as well as, through gaps in the map, what is yet to be tested.
Knowledgebase Creation & Components
The Telemakus system consists of a database, research report schema and tools to create relationship maps among concepts across documents. The research report schema serves as a surrogate for the study, methods and research findings for each document as well as providing an interactive search interface. The schematic representations include standard bibliographic information (author, title, journal), information about the research design and methods (age, sex, number of subjects, pre-treatment and treatment regimen, organism and source of organism) and, most importantly, research findings derived from data tables and figures.
The elements extracted by the Telemakus system from full-text documents are listed in Table . There are 22 fields for each document, with 12 routinely obtained from PubMed. Of the remaining fields, entries to 4 are controlled by thesauri. Two fields, Authors and SourceOfOrganisms use customized thesauri developed specifically for the Telemakus system. Two additional fields, the ResearchFindings and Organism fields, use the Unified Medical Language System® (UMLS®) Metathesaurus® as the basis for creating a controlled vocabulary.
Research report schema database fields
The UMLS Metathesaurus is a rich database of information on concepts that appear in one or more of a number of different controlled vocabularies and classifications used in the field of biomedicine. It provides a uniform, integrated distribution format of over 95 biomedical vocabularies and classifications and contains syntactic information. All Metathesaurus concepts are assigned to specific types or categories – e.g., "Disease or Syndrome," "Virus" – and the Semantic Network contains information about the permissible relationships among these types – e.g., "Virus" causes "Disease or Syndrome" [45
]. The 2004 edition of the UMLS Metathesaurus includes over 1 million biomedical concepts and 2.8 million concept names in its source vocabularies [46
The thesauri are reviewed (curated by expert indexers) in order to create a consistent controlled vocabulary structure. As indicated in Table , research concepts and organism type thesauri are derived from the UMLS. As new concepts are identified from the document's data tables and figures, the UMLS is used to identify preferred terms that are added to the controlled vocabulary database. In addition to the preferred term, its synonyms, semantic type, broader and narrower terms and Unique Identifier are captured. The UMLS provides a very powerful approach to rapidly creating a robust scientific thesaurus in support of consistent and precise searching. Further, the semantic type descriptors for each concept and semantic network may offer some interesting opportunities for intelligent searching and mapping of research findings and their relationships in the future.
At present, data extraction utilizes systems with both manual and automated processes. An evolving thesauri-building and revising approach are important components of the Telemakus system to ensure that vocabulary identification and management reflect the specialized needs of the knowledge domain as new research concepts are identified and reported.
The knowledgebase construction process begins with an Internet search of a bibliographic database (e.g., PubMed, Web of Science®, etc.). Database elements are extracted and verified against the relevant thesaurus. As new concepts are identified the UMLS is checked for the preferred term and it is added to the appropriate Telemakus thesaurus – along with synonyms, narrower and broader terms.
One of the key innovations in the Telemakus system is the use of the data tables and figures for locating the concepts studied (and tested) by the researchers. Concentrating on the legends from data tables and figures focuses the extraction process and reduces the background noise of the full-text document, making the process tractable. In general, the information content of data tables and figures can be broken into two types: "facts" and "findings." Facts include reporting experimental design and comparative characteristics of animals in the study group (e.g., weight, age, pre-existing conditions, etc.). Findings are the results of the study (the research findings). Research findings are extracted from each of the "findings" data tables in a process described in Figure .
Process for deriving research relationships from data tables
Table provides a list of legends (the descriptions of content) from data tables and figures from a single research report and the end results of the extraction process. The legends are categorized into information content type (Fact or Research Finding), extracted concept relationships and concept relationships normalized (preferred terms) using the UMLS tools. In Table , the first two legends report "facts" (the experimental design and the composition of the diets of the research animals) while "findings" are reported in the remaining legends. The third column displays the noun phrases extracted from the legends which are then mapped to their corresponding UMLS preferred terms, as seen in the fourth column. When mapped, the term "dietary intake" maps to "energy intake" and "mammary gland carcinomas" maps to "mammary neoplasms." This provides a "controlled vocabulary" which enhances the consistency of retrieval from the knowledgebase.
Table 2 Information content type categorization and relationship concept candidates for a sample of table/figure legends Extracted from – Zhu Z, Haegele AD, Thompson HJ: Effect of caloric restriction on pre-malignant and malignant stages of mammary carcinogenesis. (more ...)
A current focus is the application of natural language processing (NLP) techniques to assist in the automation of concept extraction process. MetaMap, a program developed by the National Library of Medicine®
) is being tested as a means of automatically parsing the legends from the data tables and figures to identify preferred UMLS concepts for addition to the Telemakus thesauri. MetaMap maps arbitrary text to concepts in the UMLS Metathesaurus; or, equivalently, it discovers Metathesaurus concepts in text. With this software, text is processed through a series of modules. First it is parsed into components including sentences, paragraphs, phrases, lexical elements and tokens. Variants are generated from the resulting phrases. Candidate concepts from the UMLS Metathesaurus are retrieved and evaluated against the phrases. The best of the candidates are organized into a final mapping in such a way as to best cover the text [47
Telemakus KnowledgeBase System Architecture
The Telemakus system architecture centers on: a relational database; a set of tools used to populate the knowledgebase with data extracted from bibliographic databases and full-text research reports; and several server side tools and programs responsible for delivering the content of the database to the public via the WWW. The entire system is built from open-source components, leveraging standard protocols and tools whenever possible.
The document processing system is initiated by an analyst who runs, reviews and edits as necessary extractions from the document being processed. It currently consists of a number of discrete phases to download, extract and analyze each document. These services are built primarily in Java running behind Tomcat and Apache and accessed by the analyst through the browser.
For the public Telemakus website interface, a number of open-source solutions have been selected and configured. An Apache web server intercepts all requests and delegates them to surrogate processes dedicated to each respective task. For requests to display the data from the database, the request is delegated to Zope, a content management service, for responding to the user's request. This typically includes running SQL queries against a PostgreSQL database and rendering the results in the conceptual schema that serves as a surrogate for each document. For tasks beyond simple queries and HTML requests, a Java Servlet™ is employed. As plain HTML is insufficient to effectively display and interact with the relationship map, a Java™ applet, TouchGraph, is used.
TouchGraph is an open-source concept-mapping tool for creating and navigating links between information sources. The tool was chosen for Telemakus because of its flexibility, customizing capabilities, high quality source code and compatibility with most browsers and operating systems (OS). The TouchGraph visualization package serializes maps to and from XML. By using Java, HTTP and XML, TouchGraph makes it easy to dynamically feed content to generate interactive nodes-and-edges maps.
Database Query and Navigation: How Does it Work?
Figure shows the initial search screen – the starting point for a search of a knowledgebase. The user can search using Boolean logic on a number of fields, including the abstract, keywords, full-text, title, research findings, etc. Each of the thesauri – Author, Research Findings, Organisms and Source of Organisms – are also available for browsing and are directly searchable. Sorting is supported by year, first author, journal title.
Figures and show the results of the search. Clicking on the first listing (Chung) results in the retrieval of the complete record for that item in the research report schema format (Figure ), a rapid summary of research methods and organism characteristics that provides quick links to a variety of types of information including the full-text of the research article. Clicking on any blue highlighted item under "Table/Figure" takes the searcher to the respective figure in the full-text article.
Display of retrieval set for a search on "neoplasms" (part 1)
Display of retrieval set for a search on "neoplasms" (part 2)
Research report schema Schema for one of the retrieved scientific reports from a search on caloric restriction and neoplasms
The research report schema also serves as a convenient interface for searching for related research concepts, offering a rapid way of following research connections through the database. For instance, clicking on "killer cells, natural – ad libitum" would retrieve additional articles that present data tables linking those two concepts.
The "map it" function, at the bottom of the retrieval set (Figure ) provides access to the visualized maps of research findings connections for the current retrieval set. Examples of the concept maps generated by clicking on "map it" from Figure are presented in Figures and . Figure presents a subset, more focused, map of research findings relating to the research concept of interest. Blue links highlight a reported (by the authors of the research report) statistically significant finding. The visualization tool permits moving from link to link and expanding the view to include a map of all research relationships reported in the retrieved set of documents (Figures and ). The user can also initiate a new search of a research term or link of interest (e.g., the relationship between survival rate and antioxidants) to retrieve all research papers which have reported this linkage. The iterative nature of the search process and ability to explore research connections from both the research schema as well as the research concept map supports the process of hypothesis exploration in a way that mimics the way many scientists work-by providing a means of exploring a variety of types of connections.
Concept map of research findings linked to neoplasms
Expanded concept map of research findings relationships