|Home | About | Journals | Submit | Contact Us | Français|
We analyzed the extent to which comparative effectiveness research (CER) organizations share terms for designs, analyzed coverage of CER designs in Medical Subject Headings (MeSH) and Emtree, and explored whether scientists use CER design terms.
We developed local terminologies (LTs) and a CER design terminology by extracting terms in documents from five organizations. We defined coverage as the distribution over match type in MeSH and Emtree. We created a crosswalk by recording terms to which design terms mapped in both controlled vocabularies. We analyzed the hits for queries restricted to titles and abstracts to explore scientists' language.
Pairwise LT overlap ranged from 22.64% (12/53) to 75.61% (31/41). The CER design terminology (n=78 terms) consisted of terms for primary study designs and a few terms useful for evaluating evidence, such as opinion paper and systematic review. Patterns of coverage were similar in MeSH and Emtree (gamma=0.581, P=0.002).
Stakeholder terminologies vary, and terms are inconsistently covered in MeSH and Emtree. The CER design terminology and crosswalk may be useful for expert searchers. For partially mapped terms, queries could consist of free text for modifiers such as nonrandomized or interrupted added to broad or related controlled terms.
The emergent field of comparative effectiveness research (CER)‡ is beset by differences in language among stakeholders. These include methodologists in organizations that promote CER, scientists who generate original data or synthesize secondary data, panels of experts who rely on extant research to design guidelines for best practice, and policymakers who identify and prioritize future research needs. For health sciences librarians who regularly support this panoply of stakeholders, it is necessary to know about differences in order to interpret service requests. For example, the following terms are used inconsistently: CER, evidence-based medicine (EBM), and health technology assessment (HTA); randomization and random sampling; efficacy and effectiveness.
Recently, the MLA News published two accessible reports to introduce librarians to CER in which the authors compare CER and EBM 2, 3. A more thorough essay comparing CER, EBM, and HTA along several dimensions appears in The Milbank Quarterly 4, with some discussion of semantic differences between North America and Europe. The authors of another paper discussing infrastructure needs and capacity for conducting CER report that while capacity is adequate, the “majority of researchers are trained in either observational study methods or randomized trials, but rarely both” 5. Thus, a lack of awareness of major approaches to research likely exacerbates the confusion in language. Note that in this paper, we use the term language to mean natural as opposed to formal language, with a focus on the use of phrases to communicate concepts for study designs. Jurafsky and Martin's text explains the disctinction 6. Several authors provide background papers on the structure of scientific language, sublanguages, and epistemological differences among disciplines 7–9.
An important aspect of CER is the focus on the generalizability of findings to diverse populations of real interest. Broadly, CER is concerned with answering questions regarding effectiveness rather than efficacy of interventions, which has implications for the usefulness of various study designs. Nonrandomized (NR) or observational studies, rather than randomized controlled trials (RCTs), may better answer effectiveness questions, even though well-known threats to validity exist for the former 10. For example, consider that a well-conducted RCT ensures the statistical equivalence of groups via randomization (random assignment of treatments to experimental units or vice versa) prior to treatment and that finding a treatment effect is therefore likely to be reproducible under the same experimental conditions. However, the design of an RCT promotes internal validity at the expense of external validity (generalizability) when the investigators cannot randomly sample “units,” such as patients. In contrast, researchers who conduct an NR study might randomly sample participants from populations of interest. Random sampling, if done well, as opposed to random assignment ensures that study groups will resemble the populations of interest. This is a major reason for recognizing the value of evidence derived from NR studies. In the best of worlds, a CER question would be answered by both RCTs and NR studies. This is why systematic reviewers who synthesize biomedical evidence look for both kinds of studies.
Unfortunately, consensus does not exist regarding how best to describe NR studies common to CER. According to the Cochrane Non-Randomised Studies Methods Group, both investigators and indexers inconsistently describe study designs 11. Challenges arise for expert searchers, indexers, and methodologists due to the hodgepodge of terms that stakeholders use within and across disciplines. This problem is well known, and groups around the world have issued statements regarding standards for reporting studies and their designs. To improve the value of medical research, an international initiative known as the EQUATOR Network 12 maintains a library of reporting guidelines by study type, such as STAndards for the Reporting of Diagnostic Accuracy Studies (STARD) for diagnostic accuracy studies 13, Consolidated Criteria for Reporting Qualitative Research (COREQ) 14, and STrengthening the Reporting of OBservational studies in Epidemiology (STROBE) 15. In general, guidelines suggest that authors name their study design in the title or abstract and use a common term, but names are not standardized. Thus, inconsistent indexing and varying stakeholder language, as well as multiple reporting standards lead to serious retrieval challenges for health sciences librarians.
In this study, we investigated whether methodologists in several highly regarded CER organizations share a terminology for study designs and to what extent. By terminology, we mean a set of mostly phrases, which is consistent with International Organization for Standardization (ISO) 1087 “Terminology–Vocabulary Standard,” described by Hammond and Cimino 16. To compare organizational terminologies, we culled design terms and terms for related concepts from relevant documents. We then built a CER design terminology based on the documents we identified to evaluate whether and how terms for study designs used by experts correspond to terms in Medical Subject Headings (MeSH) 17 and Emtree 18, the controlled vocabularies for MEDLINE and Embase, respectively. To support librarians, we developed a crosswalk between MeSH and Emtree with suggestions for queries when design terms partially map to broad controlled terms or fail to map. We also explored whether scientists use CER design terms to describe their own studies.
To ensure relevancy, we elected to work with classification algorithms from respected CER organizations. Further, to ensure validity, we selected algorithms already vetted by methodologists. We therefore chose algorithms developed by organizations identified in two recent methods studies funded by the Agency for Healthcare Research and Quality (AHRQ) 19, 20. The organizations and data sources included:
We extracted terms from the selected resources and augmented subsequent lists with designs mentioned in corresponding glossaries, tables, and appendixes. We refer to the resultant lists of terms as local terminologies (LTs) throughout this paper.
In developing the Alberta algorithm, a steering committee with members from AHRQ and AHRQ-funded evidence-based practice centers (EPCs) identified thirty-one organizations and experts. They asked respondents to return classification tools or systems to “ensure that we have a broad spectrum” (Letter of Request to Identify Study Design Classification Tools, Appendix C; see also Appendix B for a list of contacts 19). The identified organizations included all fifteen of the AHRQ-funded EPCs and seven other organizations, such as the Cochrane Collaboration, the Campbell Collaboration, and the National Health Service (NHS) of the United Kingdom. Additionally, nine anonymous experts were contacted. Eleven respondents returned twenty-three tools, algorithms, guidelines, or instruments for classification. Ten were selected for further analysis. Members of the steering committee independently rated the selected tools and identified the Cochrane algorithm as most suitable for further development. The ADA and RTI algorithms were rated second and third, respectively.
The Alberta algorithm is the basis for a framework currently being developed by AHRQ to promote a standard taxonomy for considering the suitability of various designs in carrying out future research. Data sources in the AHRQ report include designs that vary somewhat orthographically from the Alberta report, along with additional terms (e.g., systematic review, modeling, and meta-analysis of individual participant data).
We manually extracted terms from the data sources just described. For example, if the source was a classification algorithm structured as a decision tree, we extracted the design term and any examples or synonyms displayed at the end of each path. If the source was a glossary, we extracted each term along with any examples mentioned in its description. If the source was a table, we extracted terms in the cells or, if applicable, from the footer defining design acronyms used as column names.
We pooled terms from all five LTs, deleted duplicates, and converted to lower case. We treated as equivalent orthographic variations—such as randomized (US spelling) and randomised (British spelling); meta-analysis, metaanalysis, and meta analysis; and before-after and before-and-after. Similarly, we considered as equivalent singular and plural words, such as study and studies, and acronyms for corresponding terms, such as RCT for randomized controlled trial and IPD for individual patient data.
After term extraction, augmentation, and processing in the manner described, the union of terms occurring in one or more LTs defined the new CER design terminology. Additionally, the intersection of terms occurring in all five LTs defined the core set of design terms.
To evaluate coverage, we searched for CER design terms in MeSH and Emtree. We recorded the type of match per term as exact, partial, or no match; coverage was defined by the distribution over match type. In MeSH, if a CER term or any of its variants directly mapped to a main heading or entry term, the match was exact; if part of the term mapped to a broader or related term or to a substring in a scope note, the match was partial; otherwise, “no match” was recorded. Mapping procedures in Emtree were modified somewhat but were quite similar to those in MeSH.
We used the US National Library of Medicine (NLM) MeSH browser 22 to search for terms in a stepwise manner. In each step, we used a different combination of browser settings but otherwise followed the same search strategy:
We also searched Emtree in Embase 18, a subscription database. We navigated to the “Find Term” tab and modified terms in a stepwise manner as in MeSH, first searching for exact matches. On occasion, searching for study or trial was helpful, as it led to a variant form that we considered equivalent (e.g., validity study does not directly map to validation study, but appears under study).
We created a crosswalk between MeSH and Emtree for CER design terms by recording the controlled terms to which they exactly or partially mapped. For partial and no matches, we recorded whether terms were negated or detailed, if appropriate. A negated term includes at least one word or phrase that is counter to or is in opposition to an affirmed word or phrase in another design term. For example, interrupted time series without comparison group is a negated term because it is counter to an interrupted time series with comparison group. Specifically, without comparison group negates with comparison group. Both design terms are detailed because they are multiword phrases with several modifiers, including interrupted, time, and comparison.
In the crosswalk, we offered suggestions regarding potential alternatives or query expansions for some design terms.
To explore whether scientists use the terms for designs and related concepts as expressed by experts in CER organizations, we used quoted strings of terms and variants and restricted our searches to titles and abstracts. Because Embase regularly adds MEDLINE records 23, we could search records from both databases via Embase, which ensured comparability of searches.
To count the number of hits per CER term by database, we compared hits from two searches. In the first, we searched de-duplicated records originating in either database using <design term>:ab,ti. In the second, we restricted the search to records from Embase using <design term>:ab,ti NOT [medline]/lim AND [embase]/lim. To find the number of hits in MEDLINE, we subtracted the count for the second search from the first. Here is a sample query:
‘before-after study’:ab,ti OR ‘before-after studies’:ab,ti OR ‘before-after design’:ab,ti OR ‘before-after designs’:ab,ti OR ‘before-after trial’:ab,ti OR ‘before-after trials’:ab,ti OR ‘before-and-after study’:ab,ti OR ‘before-and-after studies’:ab,ti OR ‘before-and-after design’:ab,ti OR ‘before-and-after designs’:ab,ti OR ‘before-and-after trial’:ab,ti OR ‘before-and-after trials’:ab,ti NOT [medline]/lim AND [embase]/lim
We used Excel 2003 and 2010 24, 25, as well as IBM SPSS version 20 26, for statistical analyses of term distributions, computation of pairwise LT overlap and overlap with the CER design terminology, evaluation of coverage, and comparison of hits for queries. By overlap, we mean the percentage of shared terms between LTs or between an LT and the CER design terminology.
The augmented LTs varied in length: Alberta (n=33 terms), AHRQ (n=39), Cochrane (n=32), ADA (n=36), and RTI (n=25). The CER design terminology (n=78) derived from terms that occurred in 1 or more LTs mostly consisted of terms for primary study designs and a few terms useful for evaluating evidence, such as opinion paper and systematic review (Table 1). About half the terms (47.44%, 37/78) appeared in just 1 LT. A few terms (8.97%, 7/78) were common to all LTs (Figure 1). These included before-after study, case-control study, case series, cross-sectional study, prospective cohort study, retrospective cohort study, and randomized controlled trial.
Alberta had the most in common with the other terminologies (mean pairwise overlap=48.77%, 24 shared terms on average); RTI had the least in common (25.65%, 12 shared terms on average) (Table 2). The overlap between pairs of LTs ranged from 22.64% (12 shared terms) for AHRQ and RTI to 75.61% (31 shared terms) for Alberta and AHRQ (Table 2). The overlap of LTs with the new CER design terminology ranged from 32.05% (25/78) for RTI to 50.00% (39/78) for AHRQ (Table 3).
Patterns of coverage in MeSH and Emtree are displayed in Table 4 and Figure 2. Coverage as defined by the distribution over match type was similar; the association was positive and statistically significant (Goodman Kruskal gamma=0.581, P=0.002). Gamma is a nonparametric measure suitable for testing the bivariate association between ordinal variables. It can be interpreted as a correlation coefficient, as it falls between −1 and +1.
Match type per term is displayed in Table 5 (online only). The terms to which CER design terms most often mapped were similar in both vocabularies. In MeSH, they were randomized controlled trial, controlled clinical trial, longitudinal studies, cohort studies, and clinical trial. In Emtree, they were randomized controlled trial, controlled study, time series analysis, and cohort analysis.
Frequent partial mapping indicated a broad or related MeSH or Emtree term. For example, the following CER terms mapped to the MeSH term randomized controlled trial: cluster randomized controlled trial, cluster randomized trial, group randomized trial, open-label randomized controlled trial, randomized trial, and single-blinded randomized controlled trial. In Emtree, the terms were the same with the exception of open-label randomized controlled trial, which mapped to open study.
We labeled CER design terms as detailed relative to MeSH and Emtree if they consisted of more than 3 words, ignoring prepositions and hyphens (23.08%, 18/78). Almost all of the MeSH terms and entry terms, and most of the Emtree terms and synonyms to which CER terms mapped were at most 3 words long. Examples of detailed CER terms included cluster quasi-randomized controlled trial and meta-analysis of individual participant data.
Several terms (14.10%, 11/78) involved negation, such as cluster nonrandomized controlled trial, interrupted time series without comparison group, nonrandomized crossover trial, and uncontrolled longitudinal study.
For exact matches in 1 or both controlled vocabularies (n=29), 1 term was detailed (3.45%, 1/29) and 1 negated (3.45%, 1/29): nested case-control study and non-experimental study, respectively. Emtree covered more terms exactly than MeSH (26 Emtree vs. 15 MeSH).
For partial matches in 1 or both controlled vocabularies (n=55), 18 terms were detailed (32.73%, 18/55) and 10 negated (18.18%, 10/55). MeSH covered more terms partially than Emtree (49 MeSH vs. 45 Emtree). Sixteen terms partially mapped to a MeSH term because of a matching substring in the scope note. For example, trend study mapped to the MeSH term sentinel surveillance because the scope note included the following excerpt: “the study of disease rates in a specific cohort, geographic area, population subgroup, etc. to estimate trends [emphasis added].”
For terms not matched in 1 or both controlled vocabularies (n=15), 13.33% (2/15) were negated: noncomparative study and non-experimental study. (Note that while non-experimental study failed to map in MeSH, it exactly mapped to a synonym for observational study in Emtree.) No unmatched term was detailed. MeSH had twice as many “no matches” as Emtree (14 MeSH vs. 7 Emtree). Both controlled vocabularies failed to cover before-after study (including variants), which is a core term appearing in all 5 LTs. Checking whether unmapped terms appeared in any of the Unified Medical Language System (UMLS) resources 27, we found that 20% (3/15) mapped to terms in the National Cancer Institute (NCI) Thesaurus 28, including community trial (C1516736), factorial study (C2826344), and parallel study (C2826345).
The average number of hits for CER design queries restricted to titles and abstracts varied with the record source and type of match.
The median (MDN) number of MEDLINE records retrieved in Embase was 1,090 (range: 0 to 222,804); the MDN number of Embase records was 380 (range: 0 to 89,807). Case report yielded the most hits for both MEDLINE and Embase.
Average hits by type of match were: MeSH exact (MDN=37,750; range: 960 to 222,804), partial (MDN=735; range: 0 to 114,303), or no match (MDN=590; range: 11 to 11,915); Emtree exact (MDN=9617; range: 54 to 89,807), partial (MDN=199; range: 0 to 22,584), and no match (MDN=152; range: 9 to 432).
Based on nonparametric independent-samples median tests (MTs), the average hits varied significantly across type of match for MeSH (MT=14.022, df=2, P<0.001) and Emtree (MT=19.789, df=2, P<0.000). Pairwise differences in hits were significant for the exact versus partial category comparison (MeSH MT=14.716, P<0.000; Emtree MT=16.258, P<0.000) and the exact versus no match category (MeSH MT=12.523, P<0.001; Emtree MT=8.362, P<0.011). Differences were statistically nonsignificant for the partial versus no match comparison in both MeSH and Emtree. P values were adjusted for the number of comparisons.
With the exception of the AHRQ and Alberta LTs, organizational terminologies varied quite a bit as measured by pairwise overlap and overlap with the new CER design terminology. The reason for this exception is that AHRQ is developing a taxonomy for study designs that builds on the Alberta classification tool. However, the overlap was not perfect because we augmented the basic set of terms that the 2 organizations share with terms from supplementary documents. Note that augmenting term lists was useful because we were not evaluating extant terminologies per se, but were interested in using documents vetted by methodologists to analyze differences in language. Thus, the mean pairwise overlap of 36% for the augmented LTs and the mean overlap of 42% with the CER design terminology substantiated what we had expected: that language varies by organization even when the domain is ostensibly the same.
To explore coverage of designs and related concepts in MeSH and Emtree, we developed a terminology that consists of terms used by organizations dedicated to promoting CER, especially systematic reviews of medical evidence. Just seven terms were common to all five organizations, and even this core set of shared terms was inconsistently covered in the controlled vocabularies. For example, the core terms case-control study, cross-sectional study, and randomized controlled trial exactly mapped to controlled terms in both MeSH and Emtree; whereas, case series exactly mapped in Emtree and partially in MeSH; prospective cohort study and retrospective cohort study partially mapped in both; and before-after study failed to map in either. This inconsistent coverage of core terms suggests that CER organizational language does not correspond well with indexing for designs.
Regarding the full set of terms for designs and related concepts in our terminology, most either partially mapped or failed to map to broad or related controlled terms. In some cases, the controlled terms were not for study designs per se, but research domains. For example, analytic study mapped to analytical research and descriptive study to descriptive research in Emtree. Interestingly, while the core term randomized controlled trial exactly mapped in MeSH and Emtree, the counter term nonrandomized controlled trial appearing in the full set did not, even though the latter is a common design. In general, negated terms rarely mapped exactly, with the exception of non-experimental study in Emtree.
Because CER is an emerging discipline, resources are being developed at the regional and federal level to help expert searchers. For example, the University of Pittsburgh Health Sciences Library System, a Regional Medical Library for the Middle Atlantic Region of the National Network of Libraries of Medicine, developed MedTerm Search Assist 29. This tool promotes sharing of biomedical terms and comprehensive search strategies among librarians. One can browse for comparative effectiveness research to find keywords, MeSH terms, and a search filter. Currently, the fields include a few design terms, such as cluster randomized trial and pragmatic clinical trial.
At the federal level, NLM resources are available online by navigating to Comparative Effectiveness Research from Topic-Specific Queries on the PubMed home page 30. For example, the complex query for Observational Studies consists of several blocks: <study designs> AND <comparative terms> AND <common CER topics>. Ignoring spelling variants, most of the terms in the study design block exactly or partially match MeSH terms that we found, with the exception of practice guidelines as topic, matched-pair analysis, and multicenter study. However, quite a few of the relevant designs identified in this study do not appear in the PubMed query.
It is worth noting that the PubMed query for Observational Studies includes terms for retrieving registry studies 31, terms which do not appear in the documents we mined for this analytical study. However, neither the PubMed query nor our CER design terminology has a term for hospital-based case control studies, a design covered in Emtree. Both registry and hospital-based case control studies are increasingly important in CER, partly because electronic medical records facilitate data reuse within health care systems and research across institutions.
The CER design terminology and its crosswalk (Table 5, online only) may be useful for expert searchers who need to search MEDLINE and/or Embase. They could consult the crosswalk when developing queries for users who want studies in the CER domain, especially studies with designs that methodologists classify with negated or detailed phrases or terms such as head-to-head study and pragmatic trial important in CER. The latter pair failed to map in MeSH and Emtree.
Throughout Table 5, librarians will find suggestions for alternative terms or query expansions. Because this is a first effort, librarians should be alert to the potential for false positives. For example, focus group exactly mapped to information processing in Emtree because it occurs in a long list of synonyms, and reliability study partially mapped to validation studies in MeSH because reliability is mentioned in the scope note.
In general, MeSH terms for <design> as topic should be avoided, as this heading is usually not assigned to primary studies. However, at times it may be necessary, for example, pre-post study mapped to evaluation studies as topic.
Methodologists used a variety of terms to classify studies involving time, including several versions of before-after study modified by controlled or cohort, time series modified by interrupted and with comparison group or without comparison group, historically controlled trial, nonconcurrent cohort study, pre-post study, several terms modified by prospective or retrospective, and uncontrolled longitudinal study. All of these are not well indexed.
In sum, queries for designs with partially mapped terms could consist of free text for modifiers such as nonrandomized or prospective added to broader or related controlled terms, if they exist. Queries for designs with unmapped terms require free text by necessity.
When we considered whether scientists use CER design terms, some striking discrepancies emerged. For example, scientists commonly used terms not well indexed in MeSH or Emtree, such as before-and-after study (1,854 total hits in Embase), descriptive study (16,408 hits), diagnostic study (4,849 hits), prospective cohort study (19,096 hits), and retrospective cohort study (14,798 hits).
On the other hand, scientists rarely used detailed terms, such as cluster nonrandomized controlled trial, cohort before-and-after study, and interrupted time series with comparison group. They were much more likely to describe in various parts of the titles and abstracts how their studies were carried out, effectively splitting up the concepts in detailed terms. For example, searching for “cohort” [tiab] AND “before-and-after” [tiab] in MEDLINE yielded 2,804 hits; whereas, searching for the CER design string “cohort before-and-after study” [tiab] yielded 0 hits (24 May 2012). As an aside, searching for just “cohort” [tiab] returned 190,261 hits, which was counter to Eldredge's finding that “authors rarely use the label ‘cohort’ when describing their methods” (p. 85) 32. His comment together with the results of this simple MEDLINE query for cohort point to presumed differences in the sublanguages of librarianship and biomedicine, although this may be changing.
To improve upon the representativeness of the CER design terminology, additional documents could be mined, such as the AHRQ and Cochrane glossaries 33, 34, which are broader than the documents we used in this study. To be globally representative, documents from major international centers, such as the NHS National Institute for Health and Clinical Excellence in the United Kingdom, could be of use.
In our exploration of scientists' language, we were unable to infer detailed study designs by simple string matching of CER phrases in titles and abstracts. Thus, to improve our approach, semantic analysis 6 of texts written by scientists could be worthwhile.
The crosswalk in Table 5 (online only) could be further developed by librarians, paying attention to the potential for false positives given their users' needs and changes in indexing. An obvious extension would be to add other controlled vocabularies for databases that librarians regularly search, such as PsycINFO. Additionally, more terms such as comment or letter for “NOTing out irrelevant content” should be added, as these can improve precision for exhaustive searches 35. Although it was not our intention to develop a search filter, our design terminology and its crosswalk could be of use to librarians and trials search coordinators who support systematic reviewers and other comparative effectiveness researchers.
In this study, we have demonstrated that the degree to which methodologists in CER organizations share a terminology for designs varies considerably. Further, we have shown that coverage of design terms and related concepts is similar in MeSH and Emtree and that the majority of terms partially map or fail to map to controlled terms. This poses challenges for librarians who support users in various CER communities. Finally, exhaustive searches require free text for concepts appearing in detailed design phrases because scientists split up terms in their titles and abstracts.
*This study was partially supported by grants awarded to Tanja Bekhuis from the National Institutes of Health, National Library of Medicine, grant no. 1K99LM010943-01A1 and grant no. 4R00LM010943-02.
†A presentation describing this study was given to the American Medical Informatics Association (AMIA) 2012 Annual Symposium, November 3–7, Chicago, IL.
ECSupplemental Table 5 is available with the online version of this journal.
‡The Agency for Healthcare Research and Quality (AHRQ) defines comparative effectiveness research as: “Comparative effectiveness research is designed to inform health-care decisions by providing evidence on the effectiveness, benefits, and harms of different treatment options. The evidence is generated from research studies that compare drugs, medical devices, tests, surgeries, or ways to deliver health care” 1.