Search tips
Search criteria 


Logo of jamiaAlertsAuthor InstructionsSubmitAboutJAMIA - The Journal of the American Medical Informatics Association
J Am Med Inform Assoc. 2012 Nov-Dec; 19(6): 1110–1114.
PMCID: PMC3534452

Applying knowledge-anchored hypothesis discovery methods to advance clinical and translational research: the OAMiner project


The conduct of clinical and translational research regularly involves the use of a variety of heterogeneous and large-scale data resources. Scalable methods for the integrative analysis of such resources, particularly when attempting to leverage computable domain knowledge in order to generate actionable hypotheses in a high-throughput manner, remain an open area of research. In this report, we describe both a generalizable design pattern for such integrative knowledge-anchored hypothesis discovery operations and our experience in applying that design pattern in the experimental context of a set of driving research questions related to the publicly available Osteoarthritis Initiative data repository. We believe that this ‘test bed’ project and the lessons learned during its execution are both generalizable and representative of common clinical and translational research paradigms.

Keywords: Informatics, computing methodologies, knowledge bases, research design, phenotype, biological markers, visualization of data and knowledge, translational research—application of biological knowledge to clinical care, linking the genotype and phenotype, methods for integration of information from disparate sources, knowledge acquisition and knowledge management, skeletal muscle, cartilage, osteoarthritis, data modeling and integration, knowledge representations, data models, imaging informatics, image analysis, CAD, radiology, pathology


Clinical and translational research programs regularly produce large amounts of heterogeneous data, information, and knowledge. For example, the NIH-funded Osteoarthritis Initiative (OAI) is a multi-center, longitudinal study that seeks to identify predictive clinical characteristics, environmental exposures, and biomarkers associated with the development and progression of knee osteoarthritis (OA).1 2 A core activity of the OAI program is the collection of demographic, anthropometric, exposure, physical performance, biochemical, genetic, and imaging data from a cohort of over 4000 participants. These data are made publicly available as a scientific resource for the broad biomedical research community. However, due to a number of factors, including inconsistent data representation schemata and a paucity of informatics methods for hypothesis discovery and testing in such multi-dimensional data sets, the ability to reuse such resources is often extremely limited.3 One potential solution to these challenges is the use of knowledge-anchored reasoning methods to discover and explore hypotheses spanning multiple, heterogeneous variables of interest that can be mapped back to their originating data sets. In this report, we describe a body of work conducted as part of the NLM-funded OAMiner project, focusing on two primary goals: (1) the development of a generalizable design pattern for the application of integrative, knowledge-anchored hypothesis discovery methods to heterogeneous data sets; and (2) the application and evaluation of that design pattern, utilizing the OAI data repository as a test bed.

Case description

The work described in this report has been conducted within the experimental context of a collaborative effort spanning the Department of Biomedical Informatics and a team of OA investigators at The Ohio State University. This context was selected to demonstrate the applicability of integrative informatics methods for hypothesis discovery in large-scale and heterogeneous data—a scenario frequently encountered in the clinical and translational science domain.

Methods of implementation

A primary objective of the OAMiner project was to develop a design pattern for knowledge-anchored hypothesis generation in large-scale and heterogeneous public data sets, drawn from current best practices and emerging methods.4–9 This design pattern is illustrated in figure 1. Of note relative to our design pattern is the collective reference to data and knowledge resources as information resources. In addition, as part of our pattern, we use the term evaluation to refer to the assessment of synthesized knowledge relative to multiple axes, including validity, usefulness, and novelty. Usefulness and validity are subjective measures of the perceived ability of such knowledge to inform an actionable hypothesis, while novelty is the degree to which a valid and useful hypothesis is unique given the current state of scientific knowledge.

Figure 1
Overview of project-specific design pattern, illustrating information resources that can be leveraged, as well as a five-phase process incorporating: (1) information needs assessment; (2) extraction of structured features from targeted information resources; ...

Building upon our design pattern, the second aim for the OAMiner project was to evaluate our approach to knowledge-anchored hypothesis generation in large-scale and heterogeneous public data sets. In the following subsections, we will describe our experience relative to the implementation and evaluation of this overall framework.

Phase 1: information needs identification

A series of semi-structured interviews and focus group discussions were conducted with a convenience sample of OA investigators at The Ohio State University (n=4) in order to identify recurring hypothesis-centric information needs encountered by those individuals. Three recurring needs were identified relative to our experimental context, namely the ability to reason upon linkages between:

  • Image derived markers and clinical measurements;
  • Image derived markers and patient reported outcomes; and
  • Image derived markers, clinical measurements, and functional status indicators.

Phase 2: feature extraction

Informed by the information needs articulated in phase 1, we pursued two parallel and complementary approaches to feature extraction, focusing on image-based and structured symbolic data, respectively.

Feature extraction from unstructured data: image-derived feature generation

In order to fully elucidate the progression of any disease including OA, it is critical to precisely define variations in structural phenotypes so that one can rigorously select specific physical features to explore both cross-sectionally and longitudinally. Computer-assisted image-derived feature extraction methods are promising in terms of addressing such information needs.8 Therefore, throughout this project, we emphasized a rigorous approach to understanding and describing OA in a discrete manner by analyzing several structural groups in the knee area (eg, cartilage, muscle, or meniscus). Specifically, we targeted the meniscus, bones (femur, tibia, fibula), and quadriceps muscles (vastus intermedius, vastus lateralis, vastus medialis, and rectus femoris). These structures were automatically or semi-automatically detected and segmented, and their characteristics (eg, volume, cross-sectional area, etc) were measured.10–17 These measurements constitute the first-order image-derived features that were used to characterize each of these structures and their components in quantifiable form. The natural variation between participants regardless of the incidence or progression of the disease was minimized through statistical normalization techniques.12 Algorithms for the extraction of first-order features also led to the generation of second-order image-derived features (eg, statistical properties such as kurtosis of intensity values) both at a single time point as well as for longitudinal analysis of temporal change in the characteristics of the disease. An example of such second-order image-derived features is provided in figure 2. In total, the implementation of first- and second-order feature extraction algorithms relative to the imaging data present in the OAI data repository allows for the creation of a set of structured and computable participant-associated image-based markers that can then be semantically annotated and integrated with other correlative data types.

Figure 2
Output of meniscus segmentation (outlined), with corresponding histogram and Gaussian curve fitting. Higher order statistical measures derived from the histogram, such as skewness and kurtosis, are examples of second-order image-derived features.

Feature extraction from structured and semi-structured data: knowledge discovery in databases

Building upon the image-derived features described above, we then focused on the extraction of high priority phenotypic features from the OAI data set. The information resource utilized in this project phase was comprised of variables related to case report form questions extracted from the OAI data dictionary. A knowledge engineer with over 10 years of experience in the biomedical informatics domain abstracted the data dictionary entries and curated those concepts into a computationally tractable format. The case report form questions (or labels) and imaging markers generated during this and the preceding image-derived feature analysis methods were then annotated with SNOMED CT concepts using the MetaMap annotation engine18 and UMLS Terminology Services Metathesaurus Browser.19 Conceptual post-coordination of concepts was necessary in order to adequately capture the context of many of these variables. For example, the phenotypic variable containing the text ‘Knee pain: in bed’ can be represented using the SNOMED CT concepts for (Knee Pain:30989003) and (Lying in bed:17535004).

Phases 3–4: feature aggregation and knowledge synthesis

Hypotheses concerning potentially novel relationships between the aforementioned variable types were induced using a component-based biomedical knowledge synthesis platform known as TOKEn (Translational Ontology-anchored Knowledge Discovery Engine).20–23 This platform uses conceptual knowledge engineering techniques to support knowledge discovery in databases,20–24 and in particular, a method known as constructive induction.25 This approach leverages domain-specific knowledge found in both publically available ontologies as well as complementary knowledge extracted from literature using text mining and machine learning methods, in order to identify knowledge-anchored relationships of interest between sets of variables in a targeted data set.20–23

In order to provide flexibility in terms of the use of TOKEn in our design pattern, components were developed to transform all knowledge sources to an OWL 2.0 representation.26 The OWL standard was chosen due to its widespread adoption and use in the knowledge engineering and semantic web communities. This approach was the basis for the implementation of a computational pipeline including the following steps (figure 3): (1) semantic annotation of heterogeneous data sets; (2) induction of relationships between identified concepts; (3) generation of OWL representations of such data; and (4) use of the TOKEn engine to induce transitive relationships between conceptual entities.

Figure 3
System design overview of the OAMiner hypothesis generation pipeline. The knowledge source component pipeline exists to extract computable information from unstructured and semi-structured knowledge using natural language processing (NLP) techniques and ...

Phase 5: evaluation

A structured survey instrument was implemented using the REDCap platform, allowing subject matter experts (SMEs) to visualize graph-based visualizations of hypotheses generated in phase 4, and evaluate those hypotheses based upon three axes as introduced earlier, specifically: validity, usefulness, and novelty. A random subset of the hypotheses generated in phase 4 have been selected for this process, which is actively being evaluated in a user-centric and iterative manner at the time of submission of this manuscript. Preliminary results from this evaluation process have indicated that: (1) hypotheses generated spanning image-derived markers and clinical measurements or performance status indicators are regularly found to be valid, useful, and novel; (2) the complexity of such hypotheses (in terms of the number of concepts, relationships, and information resources involved in their generation and presentation) can affect the ability of SMEs to readily evaluate such constructs; and (3) the use of post-coordinated conceptual entities to comprise such hypotheses remains an open area of investigation, yielding variable results in terms of the validity and usefulness of resulting hypotheses. We intend to report upon the full spectrum of these evaluative results, which extend beyond the scope of this case report, in a subsequent manuscript.


Based upon our experiences in implementing and evaluating the aforementioned design pattern in a prototypical use case, we have identified a number of critical lessons that we believe are applicable to analogous projects, as summarized below.

  • Each disease has its own detection, diagnosis, and treatment regimen and in many cases imaging is a critical component of these steps. However, scalable methods that allow such imaging data to be integrated with other, heterogeneous data types have not been well developed. In order to apply the approaches we have described in this report to other diseases, it will be very important to understand: (1) the type of imaging modalities (eg, MRI, CT, histopathology, etc); and (2) imaging stage (eg, detection, radiation therapy, treatment monitoring) and the imaging-informed key decision factors (eg, detection of tumors, accurately quantifying the size and morphology of structures in longitudinal studies). Imaging should inform and extend the current knowledge with all its capabilities (accuracy, consistency, and measurement of phenomena of interest (eg, texture characteristics)). A combination of the consistent and objective measurement tools combined with newly developed ones, opens up the possibilities of imaging-based biomarker generation and validation.
  • The lack of standards surrounding the annotation of public research data sets are a barrier to using existing knowledge sources consistent with the design patterns described in this report. Without conceptual metadata, usage of publically available data is limited to traditional syntactic analysis. Annotation of the data is required in order to process and infer knowledge over the data collection at a conceptual level. However, many logistical issues concerning conceptual annotation of generic data sets have not been standardized. This includes mapping and storage of the annotations, bidirectional query support from the data to its metadata, and inaccurate automated named-entity-recognition software products. Because of these issues, we encountered many barriers while trying to reuse existing knowledge collections in OAMiner.
  • There are a number of sources of potential bias with respect to using SMEs when evaluating the validity of integrative hypotheses. SMEs typically have very deep knowledge in a single relatively narrow domain and in essence their knowledge is ‘siloed.’ The hypotheses generated using TOKEn will frequently span across these silos of knowledge, while SMEs will tend to be inherently anchored in their domains. When evaluating the concept maps linking the phenotypic markers to the image-derived biomarkers, an SME may dismiss certain hypotheses because the pathway linking them may go outside of their area of expertise. In addition, it is also possible that an SME's established mental model for the relationship between the presented phenotypic marker and the image-derived biomarker is that they are not related when a plausible pathway is presented. In this situation, it can be difficult for an SME to ignore the anchoring in their reasoning in order to consider other possible outcomes.


In this report, we have described a design pattern for the integrative and knowledge-based discovery of hypothesis in a high-throughput manner. We also have presented a number of lessons learned from the application of this design pattern in a prototypical research use case. In doing so, we hope to inform future and analogous research and development efforts, and to catalyze further innovation in this timely and critical area of applied biomedical informatics.


The authors wish to acknowledge to contributions of Dr Peter Embi to the design and evaluation phases of the OAMiner project, as well as Dr David Flanigan for useful discussions and participating as an SME in the validation of generated hypotheses. We also wish to acknowledge the contributions of Mr Omkar Lele to the design of the TOKEn platform.


Contributed by

Contributors: PROP, RDJ, TMB, and MNG contributed to the conceptualization and planning of the work summarized in this case report. PROP, TBB, AML, SJ, and MNG designed, implemented, and executed the data generation and analysis pipelines as described. PROP, RDJ, TMB, TBB, AML, SJ, and MNG participated in the preparation of the final manuscript, as submitted.

Funding: This work was supported by the National Library of Medicine (R01LM010119, PI: M Gurcan) and the NCRR-funded OSU Center for Clinical and Translational Science (U54RR024384, PI: R Jackson).

Competing interests: None.

Ethics approval: Ethics approval was provided by The Ohio State University Institutional Review Board.

Provenance and peer review: Not commissioned; externally peer reviewed.


1. Fawaz-Estrup F. The osteoarthritis initiative: and overview. Med Health R I 2004;87:169–71 [PubMed]
2. Lester G. Clinical research in OA—the NIH Osteoarthritis Initiative. J Musculoskelet Neuronal Interact 2008;8:313–14 [PubMed]
3. Arzberger P, Schroeder P, Beaulieu A, et al. Promoting access to public research data for scientific, economic, and social development. Data Sci J 2004;3:135–52
4. Ruttenberg A, Clark T, Bug W, et al. Advancing translational research with the Semantic Web. BMC Bioinformatics 2007;8(Suppl 3):S2. [PMC free article] [PubMed]
5. Payne PR, Embi PJ, Sen CK. Translational informatics: enabling high-throughput research paradigms. Physiol Genomics 2009;39:131–40 [PubMed]
6. Maojo V, García-Remesal M, Billhardt H, et al. Designing new methodologies for integrating biomedical information in clinical trials. Methods Inf Med 2006;45:180–5 [PubMed]
7. Faustino RS, Chiriac A, Terzic A. Bioinformatic primer for clinical and translational science. Clin Transl Sci 2008;1:174–80 [PMC free article] [PubMed]
8. Chung TK, Kukafka R, Johnson SB. Reengineering clinical research with informatics. J Investig Med 2006;54:327–33 [PubMed]
9. Butte AJ. Medicine. The ultimate model organism. Science 2008;320:325–7 [PMC free article] [PubMed]
10. Ababneh S. An Automated Content-Based Segmentation Framework: Application to MR Images of Knee for Osteoarthritis Research. 2010 IEEE International Conference on Electro/Information Technology (EIT 2010). IEEE, Normal, IL; 2010
11. Ababneh SY, Prescott JW, Gurcan MN. Automatic graph-cut based segmentation of bones from knee magnetic resonance images for osteoarthritis research. Med Image Anal 2011;15:438–48 [PMC free article] [PubMed]
12. Prescott J, Best TM, Haq F, et al. Vastus intermedius cross-sectional area is associated with radiographic severity of knee osteoarthritis. 58th American College of Sports Medicine (ACSM) Annual Meeting; May 31-June 4. Denver, CO, 2011
13. Prescott JW, Best TM, Swanson MS, et al. Anatomically anchored template-based level set segmentation: application to quadriceps muscles in MR images from the Osteoarthritis Initiative. J Digit Imaging 2011;24:28–43 [PMC free article] [PubMed]
14. Prescott JW, Pennell M, Best TM, et al. An automated method to segment the femur for osteoarthritis research. Conf Proc IEEE Eng Med Biol Soc 2009;2009:6364–7 [PMC free article] [PubMed]
15. Prescott JW, Priddy M, Best TM, et al. An automated method to detect interstitial adipose tissue in thigh muscles for patients with osteoarthritis. Conf Proc IEEE Eng Med Biol Soc 2009;2009:6360–3 [PMC free article] [PubMed]
16. Prescott JW, Swanson MS, Powell K, et al. Template-Based Level Set Segmentation using Anatomical Information Application to Quadriceps Muscles in MR Images from the Osteoarthritis Initiative. 2009 24th International Symposium on Computer and information Sciences. IEEE, Guzelyurt, Cyprus; 2009:24–8, 746.
17. Swanson MS, Prescott JW, Best TM, et al. Semi-automated segmentation to assess the lateral meniscus in normal and osteoarthritic knees. Osteoarthritis Cartilage 2010;18:344–53 [PMC free article] [PubMed]
18. NLM MetaMap Portal. 2011. (accessed 9 Aug 2011).
19. NLM UMLS Terminology Services. 2011. (accessed 9 Aug 2011).
20. Payne PR, Borlawsky TB, Kwok A, et al., editors. , eds. Ontology-anchored Approaches to Conceptual Knowledge Discovery in a Multi-dimensional Research Data Repository. 2008 AMIA Translational Bioinformatics Summit. San Francisco: American Medical Informatics Association, 2008 [PMC free article] [PubMed]
21. Payne PR, Borlawsky TB, Kwok A, et al. Supporting the design of translational clinical studies through the generation and verification of conceptual knowledge-anchored hypotheses. AMIA Annu Symp Proc 2008:566–70 [PMC free article] [PubMed]
22. Payne PR, Borlawsky TB, Rice R, et al. Evaluating the Impact of Conceptual Knowledge Engineering on the Design and Usability of a Clinical and Translational Science Collaboration Portal. AMIA Clinical Research Informatics Summit. San Francisco, CA: American Medical Informatics Association, 2010 [PMC free article] [PubMed]
23. Payne PR, Huang K, Keen-Circle K, et al. Multi-dimensional Discovery of Biomarker and Phenotype Complexes. AMIA Translational Bioinformatics Summit. San Francisco, CA: American Medical Informatics Association, 2010 [PMC free article] [PubMed]
24. Payne PR, Mendonca EA, Johnson SB, et al. Conceptual knowledge acquisition in biomedicine: a methodological review. J Biomed Inform 2007;40:582–602 [PMC free article] [PubMed]
25. Bloedorn E, Michalsi RS. Data-driven constructive induction. IEEE Intelligent Systems and their Applications 1998;13:30–7
26. W3C Web Ontology Language (OWL). 2011. (accessed 9 Aug 2011).

Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of American Medical Informatics Association