Clinical research eligibility criteria specify the medical, demographic, or social characteristics of eligible clinical research volunteers. Their free-text format remains a significant barrier to computer-based decision support for electronic patient eligibility determination,1
clinical evidence application,2
and clinical research knowledge management.3
Knowledge representation can formalize information in a domain to support automated reasoning; consequently, many knowledge representations for eligibility criteria have been proposed,2
with a recent focus on specifying the common data elements in eligibility criteria (eg, the agreement on standardized protocol inclusion requirements for eligibility—ASPIRE) or the syntactic structures in eligibility criteria (eg, the eligibility rule grammar and ontology—ERGO).4
However, the considerable variation among these knowledge representations generates significant challenges for achieving semantic interoperability among systems using them. There is a great need for a shared knowledge representation for clinical research eligibility criteria that can be utilized by different decision support systems, although there is no consensus on the key requirements for such a knowledge representation.
As text remains the primary knowledge source for humans, an important requirement for a knowledge representation, and a key natural language processing (NLP) challenge for using the existing knowledge representations, is linking the syntactic structures or semantic arguments in text to corresponding knowledge representations. For example, a knowledge representation of the criterion ‘diagnosis of osteoarthritis of the knee for at least 6 months’ involves the extraction of the sentence constituents such as ‘osteoarthritis’, ‘knee’, and ‘for at least 6 months’ and the annotation of a medical condition (‘osteoarthritis’) with its body location being ‘the knee’ and its temporal duration being ‘≥6 months’. Domain experts are often required to perform such annotations manually or semi-automatically. The recent ERGO annotation process provides NLP support,4
but it requires manual selection from templates defined for simple, complex and comparison criteria, as well as manual mapping from criteria sentence constituents to ERGO annotation frames (eg, ‘second expression’ or ‘statement connector’). These frames do not naturally match with the corresponding semantic roles of these sentence constituents in eligibility criteria, in which a semantic role is the name of a semantic argument or the relation between a syntactic constituent and a predicate. Examples of semantic arguments for English include locative, temporal, and manner. The recognition and annotation of semantic arguments is required for answering, ‘who’, ‘when’, ‘what’, ‘where’, ‘why’, and other questions in information extraction, question answering, summarization, and all NLP tasks that require semantic interpretation.5
The above example criterion ‘diagnosis of osteoarthritis of the knee for at least 6 months’ can be decomposed to three semantic arguments: ‘diagnosis of osteoarthritis’, ‘of the knee’, and ‘for at least 6 months’. Their corresponding semantic roles are medical condition, body location, and temporal constraint, respectively.
The frequent recursive structures, in which a sentence consists of multiple phrases that are themselves composed of phrases or words, and hierarchical syntax, in which there are multiple levels of syntactic grammar rules in one sentence, further complicate the NLP challenges. The criterion ‘chronic administration (defined as more than 14 days) of systemic high dose immunosuppressant drugs during a period starting from 6 months prior to administration of the vaccine and ending at study conclusion’ is such an example. Its hierarchical syntax is illustrated in . At the top level, the sentence consists of two semantic arguments: medication event and temporal constraint. Each semantic argument has its own information structure; therefore, at the second level, the medication event can be decomposed to three semantic arguments: temporal modifier, dosage and drug name or description, while the temporal constraint is decomposed to duration, temporal relation and anchor. These concepts can be further decomposed to semantic arguments with finer granularity at lower levels. To the best of our knowledge, current NLP methods cannot parse and encode free-text criteria using the existing knowledge representations at the same fine granularity level as shown in , yet this ability is much desired to enable faceted search among clinical research eligibility criteria. Therefore, there is a great need to bridge this gap with a semantic knowledge representation for clinical research eligibility criteria that can facilitate its symbiotic interactions with NLP tools.
The hierarchical syntax of an example criterion. Semantic role labels are in bold text. The corresponding sentence constituents are in italic text.
Information extraction has been a central research area in NLP, especially in biomedical language processing.6
A large body of work has highlighted the difficulties that arise when target knowledge representations differ greatly from the sublanguage knowledge and information structure in source text.7
One can reduce the effort to extract information from text by adopting a knowledge representation that naturally aligns with the information structure in text. A key step in achieving this alignment is to induce the semantic knowledge representation directly from the text. For example, researchers in the biomedical domain have considered methods to facilitate semantic interoperability across different text processing systems by developing the Canon model.8
Similarly, we are motivated to create a semantic representation for eligibility criteria that can serve as a shareable conceptual schema for clinical research eligibility criteria. With such a good semantic representation, we can approximate the results of an ideal NLP system by enabling progressive semistructured information extraction from clinical research eligibility criteria through automatic, recursive semantic role labelling. Semantic role labelling is also referred to as semantic argument identification and classification.
Previously, we analyzed the terms in clinical research eligibility criteria and discovered that 20 semantic types from the unified medical language systems (UMLS)9
cover over 80% of the terms in eligibility criteria,10
which leads to our hypothesis that the UMLS is a good semantic knowledge source for a semantic representation for eligibility criteria.11
We also hypothesize that eligibility criteria contain a manageable number of semantic patterns, or combinations of the UMLS semantic types. Moreover, syntactic parsing has been used successfully to extract semantic patterns in different domains.12
Therefore, we further hypothesize that a syntactic parser integrated with a pattern-mining algorithm can facilitate efficient semantic pattern extraction in clinical research eligibility criteria.
In the rest of this paper, we present an integrated semantic processing framework called eligibility criteria extraction and representation (EliXR)—for inducing natural semantic role labels from text. We contribute a novel semantic network that defines the common semantic role labels for clinical research eligibility criteria and their frequent semantic relations. We also demonstrate the feasibility of using these semantic role labels to annotate eligibility criteria with nearly perfect interrater reliability and discuss the potential of using the EliXR analysis pipeline to facilitate semistructured information extraction from free-text eligibility criteria.