Clustering is an unsupervised learning
technique that does not need a human labeled training set but rather identifies the similarities between instances, whereas classification is a supervised machine learning
approach that needs to be trained using manually labeled examples27
. In this paper, we present an effective approach to dynamic categorization of clinical research eligibility criteria by integrating hierarchical clustering and classification algorithms through the use of a shared semantic feature representation based on the UMLS semantic types. Our method demonstrates the value of using the UMLS semantic types for feature representation. To improve machine learning efficiency, various approaches have been developed to automate training data generation28–30
. Our semantic annotator automatically generates features based on the UMLS semantic types and significantly reduces the learning dimension compared to the traditional “bag of words” method. Prior studies manually defined categories for clinical eligibility criteria4–5
. Our method reduces the human effort required for category development and contributes a set of fine-grained semantic categories for clinical eligibility criteria. Moreover, previously proposed categories for clinical sentences were often task-dependent, such as the study that assigned Introduction, Methods, Results, or Discussion categories to sentences in MEDLINE abstracts31
. To our knowledge, our research is the first of its kind to automatically categorize clinical research eligibility criteria based on the semantic similarities in the criteria. Of the 5 classification algorithms, the best performing classifier is the Decision Tree J48, which achieves an overall F1-score of 86.9%. In a different clinical domain, McKnight and Srinivasan32
reported a method incorporating sentence position and “bag of words” as learning feature and achieved results with F1-scores ranging from 52% to 79% for different categories. Compared to the existing methods, our method shows the potential to significantly improve sentence classification accuracy.
Our method for dynamic categorization of criteria sentences is inspired by and extends a notable related work for dynamic categorization of documents, which is the DynaCat system developed by Pratt25
. DynaCat utilized the UMLS for knowledge-based query terms labeling and PubMed document classification. All query terms were automatically encoded with MeSH terms, but document categories and classification rules were manually specified for document categorization. We extended DynaCat by using the hierarchical clustering tools to automatically induce semantic categories for the objects to be categorized and by using a machine-learning approach to train the classifier, which was an improvement over manually defined rule-based classifiers. By using MeSH terms, DynaCat achieved standards-based query term annotation but did not reduce the feature space. As an extension, we used the UMLS semantic types to annotate eligibility criteria concepts and significantly reduced the feature dimension for machine learning-based classification. Furthermore, DynaCat performed categorization at the document level; in contrast, our method allows categorization at the sentence level.
We also compared our semi-automatically induced criteria categories to existing clinical data or clinical query categories provided by various standardization organizations, such as The Health Level Seven (HL7)33
, the MURDOCK study group34
, and the BRIDG group35
. A significant portion of our categories overlaps with the manually defined standards. For instance, The Continuity of Care Document (CCD) defined by HL7 contains 17 clinical data types33
, such as Problems, Procedures
, and Medications
, which are also included in our 27 categories. Those data elements that do not intersect with our categories include Header
for message formatting and Payer
for payment, which are not semantically interesting. The MURDOCK study group proposed 11 study variables34
for integrating clinical data. Several of them can be aligned with our categories, such as Demographics, Physical examinations
, and Laboratory test results
. The Biomedical Research Integrated Domain Group (BRIDG) Model was developed by a group of domain experts for clinical data modeling. The BRIDG model defined 17 eligibility criterion attributes. We were able to align 16 out of 17 BRIDG attributes with our induced semantic classes. We also identified 8 classes that were not specified by the BRIDG model. We observed that some highly prevalent criteria categories that we identified were not defined in BRIDG, such as Therapy or Surgery
, which has 48% prevalence in eligibility criteria published on the ClinicalTrials.gov. These results imply that our criteria categories are comparable to those developed by clinical experts and contain categories that may be missed by clinical experts.
In this study, some classification errors were caused by the noise in the UMLS. For example, in the criterion “Alkaline Phosphatase < 2.5 times ULN,” the term Alkaline phosphatase had a UMLS semantic type Pharmacologic Substance; therefore, this criterion was classified as Pharma Substance or Drug. However, the criterion specifies the range of a lab test variable, which should be classified as Lab Test Results. Similarly, the criterion History of Cholecystectomy was mapped to a general semantic type Finding but human reviewer considered this criterion as a past surgery, whose category should be Therapy and Procedure.
We can improve our current methodology in several ways in the future. We identified two open research questions for classifying clinical sentences. One is to develop better machine learning algorithms for imbalanced training data. As we demonstrated in section 3.2, different categories achieved varying degrees of accuracy, which was partially caused by the different prevalence and incidence of these categories, as indicated in . Learning from imbalanced data sets where the number of examples of one (majority) class is much higher than others, machine-learning algorithms tend to produce better predictive accuracy over the majority classes but poorer predictive accuracy over the minority classes. This is an open research challenge for the machine learning community. Over-sampling algorithms36
can be used to improve the performance for minority classes. Another needed improvement is to develop multi-label classifier for eligibility criteria. Although the majority of eligibility criteria contain only one topic, there are still about 8% of eligibility queries containing multiple topics. For example, the criterion “pregnant women
” contains two topics pregnancy
. Another example is “male and female with age between 18 and 65
.” Multiple topics may also be present less explicitly in some examples; for instance, “positive pregnancy lab tests
” could be categorized as both Lab Test Results
. However, our classifier only assigns one category to these eligibility criteria. The categories resulting from hierarchical clustering are not completely mutually exclusive and can contain some hidden relations (e.g., a set of lab tests for measuring pregnancy), which also could have affected the classification accuracy. These research questions are worth more studies in the future.