In consultation with a clinician, we selected the four following disease topics for data acquisition:
· Arterial hypertension
· Diabetes mellitus type 2
· Congestive heart failure
· Pneumococcal pneumonia
Each disease is a significant global health concern, and of interest to clinicians in many areas of the world. Collectively, they have an interesting variety of preventive interventions and treatment options.
We executed a single PubMed search query for each disease topic and point-of-view pairing, (i.e., drug treatment or prevention), using specific MeSH term and subheading combinations. The following lists indicate the exact MeSH terms and subheadings we used in forming these pairings:
· Diabetes Mellitus, Type 2
· Heart Failure
· Pneumonia, Pneumococcal
· drug therapy
· prevention and control
For example, to acquire citations addressing drug treatment options for pneumococcal pneumonia, we executed the search phrase “Pneumonia, Pneumococcal/drug therapy[Mesh]”. To provide an evidence-based focus, we first restricted output to the publication types “clinical trials,” “randomized controlled trials,” “practice guidelines,” and “meta-analyses.” We then acquired citations for systematic reviews, using the publication type “review” and the keyword phrase “systematic review.” Realistically, a clinician could engage Semantic MEDLINE using anything from a general keyword search to a very sophisticated search utilizing many of PubMed’s search options. In addition to providing the initial topic/point-of-view pairing, this method of forming search queries also provided a middle ground within the spectrum of queries a clinician might actually use. We also restricted publication dates to coincide with the most recently published source materials DynaMed used in building their recommendations, which served as the base for our evaluative reference standards (described in detail below). We restricted the retrieval publication dates in order to not retrieve materials that DynaMed curators could not have reviewed in creating their own recommendations. These cutoff dates are indicated in the Results section tabular data. The eight total search queries resulted in eight separate citation datasets, each representing a pairing of one of the four disease topics with one of the two subheading concepts. We executed the eight search queries and downloaded all citations in the period of July - August 2011.
We processed each of the eight citation datasets separately with SemRep, then with Semantic MEDLINE utilizing the Combo algorithm. We also processed the four SemRep output datasets originating from the search queries that included the drug therapy subheading with conventional Semantic MEDLINE utilizing the built-in treatment point-of-view schema (i.e., with predetermined, hard-coded patterns). We used the following UMLS Metathesaurus preferred concepts as seed topics (required by Semantic MEDLINE) to summarize SemRep data originating from both disease/drug treatment and disease/prevention and control search query pairings:
· Hypertensive disease
· Diabetes Mellitus, Non-Insulin-Dependent
· Congestive heart failure (OR Heart failure)
· Pneumonia, Pneumococcal
We built a reference standard for each disease topic/point-of-view pairing, using vetted interventions from DynaMed, a commercial decision support product. We captured the DynaMed text for recommendations on both preventive and drug treatment interventions for each disease topic. We forwarded this text to two physician-reviewers, who highlighted the interventions they thought were viable for the associated diseases. In annotating these materials, we instructed the reviewers to ask themselves “What are the drugs used to treat this disease?” and “What interventions prevent this disease?”. Disagreements between the two annotators were forwarded to a third physician adjudicator, who made the final decision regarding the conflicting annotations. The two primary reviewers were a cardiologist and a preventive medicine specialist. The adjudicator was a pathologist. We measured agreement between the two reviewers using fundamental inter-annotator agreement (IAA) where instances of agreement are divided by the sum of agreement instances and disagreement instances, or in other words, matches/(matches
non-matches). As an example, we list below the final reference standard of DynaMed arterial hypertension preventive interventions:
· Maintain normal body weight
· Reduce sodium intake
· Increased daily life activity
· Higher folate intake
· Regular aerobic physical activity
· Diet reduced in saturated and total fat
· Walking to work
· Increased plant food intake
· Diet rich in fruits, vegetables and low- fat dairy products
· Whole-grain intake
· Regular tea consumption
· Limit alcohol use
The final, combined reference standards included a total of 225 interventions, with an average of approximately 28 interventions for each disease topic/point-of-view pairing. Table lists the counts for all eight reference standards.
Reference standard intervention counts
We built eight baselines that simulated what a busy clinician might find when directly reviewing the PubMed citations. This is based on techniques developed by Fiszman [26
] and Zhang [31
]. To build baselines for the four disease topic/drug treatment pairings, we processed their PubMed citations with MetaMap, restricting output to UMLS Metathesaurus preferred concepts associated with the UMLS semantic group Chemicals and Drugs, and removed vague concepts using Novelty processing. Threshold values were determined by calculating the average mean of term frequencies in a baseline group, and then adding one standard deviation to the mean. In each group, all terms whose frequency scores exceeded the threshold value were retained to form the group’s baseline. For example, for the congestive heart failure drug treatment group, the method extracted 1784 terms that occurred 63924 times in the MetaMap data, with a mean of approximately 35.8 occurrences per term, and a standard deviation of 154.4. This produced a cutoff threshold of 190.3. Therefore, all MetaMap terms that occurred 190 times or more were included in the congestive heart failure drug treatment baseline (a total of 72 terms). This method is meant to simulate the types of terms a busy clinician might notice when quickly scanning PubMed citations originating from a search seeking drug treatment for a given disease.
We formed baselines for citations emerging from each disease topic/prevention and control pairing in a similar manner. We extracted the lines from the associated PubMed citations that contained the phrases “prevent,” “prevents,” “for prevention of,” and “for the prevention of.” These lines were processed with MetaMap, and all UMLS Metathesaurus preferred concepts associated with the UMLS disorders semantic group were removed, since the focus was preventive interventions and not the diseases themselves. Threshold values were calculated for the remaining terms, and those whose frequencies exceeded their threshold scores were retained as baseline terms. To reiterate, preventive baselines (as well as the drug treatment baselines) are meant to simulate what a busy clinician might notice when seeking interventions while visually scanning PubMed citations originating from a search seeking such interventions for a given disease.
Comparing outputs to the reference standards
We evaluated outputs for the two summarization methods (Combo algorithm and conventional schema summarization) and the baselines by manually comparing them to the reference standards for the eight disease topic/subheading pairings. Since the reference standard was always a list of interventions, the comparison was straightforward. We measured recall, precision, and F1-score (balanced equally between recall and precision).
For both summarization systems, we measured precision by grouping subject arguments by name and determining what percentage of these subject groups expressed a true positive finding. For outputs for the four disease topic/drug intervention pairings, we limited analysis to semantic predications in the general form of “Intervention X_TREATS_disease Y”, where the object argument reflected the associated disease concept. If the subject intervention X argument matched a reference standard intervention, that intervention received a true positive status. In similar predications where the subject argument was a general term, such as “intervention regimes,” we examined the original section of citation text associated with the semantic predication. If this citation text indicated a reference standard intervention it received a true positive status. For example, in the dynamic summarization output for arterial hypertension prevention, the semantic predication “Dietary Modification_PREVENTS_Hypertensive disease” summarized citation text that included advice for dietary sodium reduction [40
]; therefore, the reference standard intervention “reduce sodium intake” received a true positive status.
Only the Combo algorithm summarized output for the four disease topic/prevention and control pairings was compared to the reference standard, since there is no conventional schema for prevention. In addition to predications in the form “Intervention X_PREVENTS_disease_Y,” other predications where argument concepts had prevention terms such as “Exercise, aerobic_AFFECTS_blood pressure” and “Primary Prevention_USES_Metformin” were used, because their value was confirmed in a previous study [41
We evaluated each baseline by comparing its terms to those of its associated reference standard. If a term in a baseline matched an intervention in the relevant reference standard, the baseline term received a true positive status. We also assigned true positive status to less specific baseline terms if they could logically be associated with related reference standard interventions. For example, in the baseline for pneumococcal pneumonia prevention the term “Polyvalent pneumococcal vaccine” was counted as a true positive, even though it did not identify a specific polyvalent pneumococcal vaccine that was in the reference standard.