|Home | About | Journals | Submit | Contact Us | Français|
The conduct of clinical and translational research regularly involves the use of a variety of heterogeneous and large-scale data resources. Scalable methods for the integrative analysis of such resources, particularly when attempting to leverage computable domain knowledge in order to generate actionable hypotheses in a high-throughput manner, remain an open area of research. In this report, we describe both a generalizable design pattern for such integrative knowledge-anchored hypothesis discovery operations and our experience in applying that design pattern in the experimental context of a set of driving research questions related to the publicly available Osteoarthritis Initiative data repository. We believe that this ‘test bed’ project and the lessons learned during its execution are both generalizable and representative of common clinical and translational research paradigms.
Clinical and translational research programs regularly produce large amounts of heterogeneous data, information, and knowledge. For example, the NIH-funded Osteoarthritis Initiative (OAI) is a multi-center, longitudinal study that seeks to identify predictive clinical characteristics, environmental exposures, and biomarkers associated with the development and progression of knee osteoarthritis (OA).1 2 A core activity of the OAI program is the collection of demographic, anthropometric, exposure, physical performance, biochemical, genetic, and imaging data from a cohort of over 4000 participants. These data are made publicly available as a scientific resource for the broad biomedical research community. However, due to a number of factors, including inconsistent data representation schemata and a paucity of informatics methods for hypothesis discovery and testing in such multi-dimensional data sets, the ability to reuse such resources is often extremely limited.3 One potential solution to these challenges is the use of knowledge-anchored reasoning methods to discover and explore hypotheses spanning multiple, heterogeneous variables of interest that can be mapped back to their originating data sets. In this report, we describe a body of work conducted as part of the NLM-funded OAMiner project, focusing on two primary goals: (1) the development of a generalizable design pattern for the application of integrative, knowledge-anchored hypothesis discovery methods to heterogeneous data sets; and (2) the application and evaluation of that design pattern, utilizing the OAI data repository as a test bed.
The work described in this report has been conducted within the experimental context of a collaborative effort spanning the Department of Biomedical Informatics and a team of OA investigators at The Ohio State University. This context was selected to demonstrate the applicability of integrative informatics methods for hypothesis discovery in large-scale and heterogeneous data—a scenario frequently encountered in the clinical and translational science domain.
A primary objective of the OAMiner project was to develop a design pattern for knowledge-anchored hypothesis generation in large-scale and heterogeneous public data sets, drawn from current best practices and emerging methods.4–9 This design pattern is illustrated in figure 1. Of note relative to our design pattern is the collective reference to data and knowledge resources as information resources. In addition, as part of our pattern, we use the term evaluation to refer to the assessment of synthesized knowledge relative to multiple axes, including validity, usefulness, and novelty. Usefulness and validity are subjective measures of the perceived ability of such knowledge to inform an actionable hypothesis, while novelty is the degree to which a valid and useful hypothesis is unique given the current state of scientific knowledge.
Building upon our design pattern, the second aim for the OAMiner project was to evaluate our approach to knowledge-anchored hypothesis generation in large-scale and heterogeneous public data sets. In the following subsections, we will describe our experience relative to the implementation and evaluation of this overall framework.
A series of semi-structured interviews and focus group discussions were conducted with a convenience sample of OA investigators at The Ohio State University (n=4) in order to identify recurring hypothesis-centric information needs encountered by those individuals. Three recurring needs were identified relative to our experimental context, namely the ability to reason upon linkages between:
Informed by the information needs articulated in phase 1, we pursued two parallel and complementary approaches to feature extraction, focusing on image-based and structured symbolic data, respectively.
In order to fully elucidate the progression of any disease including OA, it is critical to precisely define variations in structural phenotypes so that one can rigorously select specific physical features to explore both cross-sectionally and longitudinally. Computer-assisted image-derived feature extraction methods are promising in terms of addressing such information needs.8 Therefore, throughout this project, we emphasized a rigorous approach to understanding and describing OA in a discrete manner by analyzing several structural groups in the knee area (eg, cartilage, muscle, or meniscus). Specifically, we targeted the meniscus, bones (femur, tibia, fibula), and quadriceps muscles (vastus intermedius, vastus lateralis, vastus medialis, and rectus femoris). These structures were automatically or semi-automatically detected and segmented, and their characteristics (eg, volume, cross-sectional area, etc) were measured.10–17 These measurements constitute the first-order image-derived features that were used to characterize each of these structures and their components in quantifiable form. The natural variation between participants regardless of the incidence or progression of the disease was minimized through statistical normalization techniques.12 Algorithms for the extraction of first-order features also led to the generation of second-order image-derived features (eg, statistical properties such as kurtosis of intensity values) both at a single time point as well as for longitudinal analysis of temporal change in the characteristics of the disease. An example of such second-order image-derived features is provided in figure 2. In total, the implementation of first- and second-order feature extraction algorithms relative to the imaging data present in the OAI data repository allows for the creation of a set of structured and computable participant-associated image-based markers that can then be semantically annotated and integrated with other correlative data types.
Building upon the image-derived features described above, we then focused on the extraction of high priority phenotypic features from the OAI data set. The information resource utilized in this project phase was comprised of variables related to case report form questions extracted from the OAI data dictionary. A knowledge engineer with over 10 years of experience in the biomedical informatics domain abstracted the data dictionary entries and curated those concepts into a computationally tractable format. The case report form questions (or labels) and imaging markers generated during this and the preceding image-derived feature analysis methods were then annotated with SNOMED CT concepts using the MetaMap annotation engine18 and UMLS Terminology Services Metathesaurus Browser.19 Conceptual post-coordination of concepts was necessary in order to adequately capture the context of many of these variables. For example, the phenotypic variable containing the text ‘Knee pain: in bed’ can be represented using the SNOMED CT concepts for (Knee Pain:30989003) and (Lying in bed:17535004).
Hypotheses concerning potentially novel relationships between the aforementioned variable types were induced using a component-based biomedical knowledge synthesis platform known as TOKEn (Translational Ontology-anchored Knowledge Discovery Engine).20–23 This platform uses conceptual knowledge engineering techniques to support knowledge discovery in databases,20–24 and in particular, a method known as constructive induction.25 This approach leverages domain-specific knowledge found in both publically available ontologies as well as complementary knowledge extracted from literature using text mining and machine learning methods, in order to identify knowledge-anchored relationships of interest between sets of variables in a targeted data set.20–23
In order to provide flexibility in terms of the use of TOKEn in our design pattern, components were developed to transform all knowledge sources to an OWL 2.0 representation.26 The OWL standard was chosen due to its widespread adoption and use in the knowledge engineering and semantic web communities. This approach was the basis for the implementation of a computational pipeline including the following steps (figure 3): (1) semantic annotation of heterogeneous data sets; (2) induction of relationships between identified concepts; (3) generation of OWL representations of such data; and (4) use of the TOKEn engine to induce transitive relationships between conceptual entities.
A structured survey instrument was implemented using the REDCap platform, allowing subject matter experts (SMEs) to visualize graph-based visualizations of hypotheses generated in phase 4, and evaluate those hypotheses based upon three axes as introduced earlier, specifically: validity, usefulness, and novelty. A random subset of the hypotheses generated in phase 4 have been selected for this process, which is actively being evaluated in a user-centric and iterative manner at the time of submission of this manuscript. Preliminary results from this evaluation process have indicated that: (1) hypotheses generated spanning image-derived markers and clinical measurements or performance status indicators are regularly found to be valid, useful, and novel; (2) the complexity of such hypotheses (in terms of the number of concepts, relationships, and information resources involved in their generation and presentation) can affect the ability of SMEs to readily evaluate such constructs; and (3) the use of post-coordinated conceptual entities to comprise such hypotheses remains an open area of investigation, yielding variable results in terms of the validity and usefulness of resulting hypotheses. We intend to report upon the full spectrum of these evaluative results, which extend beyond the scope of this case report, in a subsequent manuscript.
Based upon our experiences in implementing and evaluating the aforementioned design pattern in a prototypical use case, we have identified a number of critical lessons that we believe are applicable to analogous projects, as summarized below.
In this report, we have described a design pattern for the integrative and knowledge-based discovery of hypothesis in a high-throughput manner. We also have presented a number of lessons learned from the application of this design pattern in a prototypical research use case. In doing so, we hope to inform future and analogous research and development efforts, and to catalyze further innovation in this timely and critical area of applied biomedical informatics.
The authors wish to acknowledge to contributions of Dr Peter Embi to the design and evaluation phases of the OAMiner project, as well as Dr David Flanigan for useful discussions and participating as an SME in the validation of generated hypotheses. We also wish to acknowledge the contributions of Mr Omkar Lele to the design of the TOKEn platform.
Contributors: PROP, RDJ, TMB, and MNG contributed to the conceptualization and planning of the work summarized in this case report. PROP, TBB, AML, SJ, and MNG designed, implemented, and executed the data generation and analysis pipelines as described. PROP, RDJ, TMB, TBB, AML, SJ, and MNG participated in the preparation of the final manuscript, as submitted.
Funding: This work was supported by the National Library of Medicine (R01LM010119, PI: M Gurcan) and the NCRR-funded OSU Center for Clinical and Translational Science (U54RR024384, PI: R Jackson).
Competing interests: None.
Ethics approval: Ethics approval was provided by The Ohio State University Institutional Review Board.
Provenance and peer review: Not commissioned; externally peer reviewed.