|Home | About | Journals | Submit | Contact Us | Français|
Structured abstracts contain distinct labeled sections (e.g., “RESULTS”). The MEDLINE/PubMed database incorporates English-language abstracts that appear in the journals that the US National Library of Medicine (NLM) indexes. If English-language structured abstracts appear in a journal that is indexed, the labels in these abstracts usually appear in all uppercase letters, generally followed by a colon, in MEDLINE/PubMed citations .
Several years after formats for more informative abstracts were proposed [2–,5], NLM studied the structured abstracts that appeared in MEDLINE from 1989–1991 as an initial step in exploring their utility in enhancing bibliographic retrieval . This early study showed that structured abstracts were an emerging, but rapidly growing phenomenon; that MEDLINE records with structured abstracts tended to have more access points (Medical Subject Headings [MeSH] terms and text words) than MEDLINE records as a whole; and that there was significant variation in the structured abstract formats that different journals prescribed.
Implementation of structured abstracts by biomedical journals has been examined on a small scale in the clinical medicine domain [7, 8], but no large-scale examination across all of MEDLINE has occurred since the first exploratory study by NLM. Hence, the objective of this study was to conduct a retrospective cohort study to measure and characterize the growth in structured abstracts in MEDLINE since 1991, with a view, again, toward exploring their utility in enhancing information display and retrieval.
NLM produces annual, static versions of the MEDLINE data (MEDLINE/PubMed Baseline Repository) that are freely available for use by any researcher . The 2007 MEDLINE baseline dataset  was used in this study. Fifteen years of MEDLINE records (7,163,494 records) with record completion dates from November 25, 1991, to November 14, 2006, were extracted from this baseline dataset. Two types of records that had been added to MEDLINE/PubMed during the relevant time period were then excluded from the final research dataset because they were not comparable to records used in the 1989–1991 study: (1) PubMed records that had not received MEDLINE indexing and (2) recently digitized records for pre-1966 Index Medicus citations. Then, all records without abstracts were excluded, leaving a MEDLINE subset of 5,483,473 records. Because the NLM indexing year ended in mid-November, records for articles with 2006 publication dates that were processed by NLM after November 14, 2006, were not part of this subset.
The 5,483,473 MEDLINE records with abstracts were used to develop and validate a new algorithm for identifying structured abstracts:
The algorithm used in the 1989–1991 study was applied to the 1992–2006 set of MEDLINE records with abstracts. However, the twenty-seven labels used in the earlier study were known to be insufficient to retrieve all the variations in structured abstract formats that had appeared since 1991. An additional set of sixty-six structured abstract labels had been identified based on label analyses done on some smaller MEDLINE subsets, information provided by a few experts in the field, and serendipitous discoveries of labels in abstracts in individual MEDLINE citations.
To identify a more complete set of labels for use in isolating structured abstracts, a new label discovery algorithm was developed and validated. The discovery algorithm examined each abstract and identified all uppercase strings that (1) appeared in the first character word position of the abstracts, (2) were followed by either a colon or period (e.g., “PRACTICES AND PATIENTS:”), and (3) occurred in an abstract that also contained at least 1 of the following strings: “RESULT:,” “RESULTS:,” “CONCLUSION:,” or “CONCLUSIONS:.” An additional 458 labels, not present in the original list of 27 or in the supplemental list of 66, were identified by this algorithm.
Knowledge gained from the discovery algorithm and random samples of records showed that a more comprehensive identification of structured abstracts could be achieved through the use of a set of labels that included slight variations, such as mixed-case, spelling variants, plurals, and punctuation variations. A set of 1,335 labels, including 784 minor variants of previously identified labels, was extracted for use in identifying the structured abstracts to be analyzed in this study.
This study ultimately defined a structured abstract as one that contained at least 3 labels that represented 3 distinct concepts (e.g., “INTRODUCTION,” “METHODS,” “CONCLUSIONS”), of which 1 must be considered an ending concept (such as “RESULTS” or “CONCLUSIONS”) from the list of 1,335 labels . This algorithm found 4.6% more structured abstracts than were identified by the 27 labels used in the 1989–1991 study.
This structured abstract definition was applied to the records in the MEDLINE subset, identifying 938,772 records. To confirm that this new structured abstract subset contained only valid structured abstracts, a random sample of 671 records (99% confidence interval; margin of error 5 for a universe size of 938,772 ) was generated and examined. All 671 abstracts conformed to the study's definition of a structured abstract.
The frequency of occurrence of each unique label in the structured abstract subset was computed. Unique labels considered variants of the same concept were then grouped under the most frequently occurring label for that concept, called the metaterm, and the frequencies of all the variants were combined to arrive at an overall total for that concept. The resulting list of metaterms was examined to determine if the metaterms could be logically grouped into a small number of higher-level categories (by small group consensus), such as “METHODS,” as a potential aid to data mining and retrieval applications.
The percentage of new MEDLINE records containing structured abstracts rose from 2.5% for 1992 to 20.3% for 2005 (Figure 1). Because the number of articles indexed each year was also increasing throughout this period, the absolute number of structured abstracts increased even more substantially, from 9,975 (1992) to 118,051 (2005, the last publication year with complete data in the research dataset), more than 1,000.0%.
MEDLINE records with structured abstracts were larger in size than MEDLINE as a whole (7,163,494 records), and they had more access points (Table 1, online only). The average size of a structured abstract record was 758 bytes bigger than the average size of a MEDLINE record. Abstracts from the structured abstract set (938,772 records) had an average of about 57 more words than abstracts from the entire MEDLINE set (5,483,473 records). MEDLINE records with structured abstracts contained an average of 4 labels, but the structured abstract labels alone were not responsible for the larger record sizes. The average title of an article with a structured abstract contained about 2 more words than the average title in MEDLINE records as a whole. Records with structured abstracts had an average of 2.4 more MeSH terms and 1.3 more authors than records in MEDLINE as a whole.
The majority of the 1,335 structured abstract labels identified in the study refer to 1 or more of only 100 concepts. Each of these 100 concepts was assigned its most frequently occurring name as a metaterm. Table 2 (online only) presents the metaterm “PROBLEM” and its 32 label variants as an example.
These 100 metaterms covered 99.9% of more than 4 million structured abstract label occurrences identified in the study. The 100 metaterms were then manually grouped into 5 higher-level categories, called metacategories:
Both the number of individual records with structured abstracts and the number of journals publishing structured abstracts has increased steadily since 1991. Structured abstracts appeared in 20.3% of the 580,583 articles indexed for MEDLINE in 2005, the last full year covered by this study, and the upward trend is continuing (23.0% in 2008). Records with structured abstracts represented 13.1% (938,772/7,163,494) of 1992–2006 MEDLINE as a whole. This is a dramatic increase from 1989–1991, when structured abstracts appeared in only 0.4% (3,873/924,748) of MEDLINE records. A total of 3,166 journal titles contributed structured abstracts from 1992–2006, in comparison to only 78 journals in 1989–1991. More than 1,000 journals have continuously published structured abstracts (starting in one year and continuing through 2005) in contrast to only 10 journals in 1989–1991.
The 1989–1991 study had some indication that the additional MeSH terms assigned to records with structured abstracts were the result of more indexing of patient demographic characteristics, such as sex or age groups. The current study confirms this. For example, in both studies, roughly 60% of records with structured abstracts were assigned the MeSH term “Female,” in comparison to about 30% of MEDLINE records as a whole. Similar substantial differences occurred for the MeSH terms “Male,” “Adult,” and “Middle Age.”
Now that structured abstracts appear in a sizeable fraction of MEDLINE records (nearly a quarter of all abstracts added to MEDLINE in a year), NLM is exploring their utility in enhancing display, assisting the indexing process, and improving information retrieval and discovery. NLM has made two changes as a direct result of this updated research on the occurrence of structured abstracts in MEDLINE/PubMed citations:
The first change improves the readability of structured abstracts in PubMed. The second change makes it possible to improve even further on the readability; for example, a technique such as color-coded labels for the metacategory to which they map could be employed to make it faster for readers to scan for the section of most interest to the search at hand. The second change also makes it possible for NLM to conduct experiments to determine the utility of features such as:
Licensees of NLM MEDLINE data will be able to conduct their own experiments and implement new functionality for users of their applications of the data, too, now that NLM has coded the structured abstracts for the 2011 DTD.
Work is also underway to identify new labels that have appeared since 2006 and to map them to the five metacategories. Detailed information about NLM structured abstract research is available on the Structured Abstracts in MEDLINE site <http://structuredabstracts.nlm.nih.gov>.
The authors are grateful to former NLM associate fellow, Amy McNeely, bookshare librarian, Benetech Initiative, Palo Alto, California, for her careful analysis that did the initial mapping of the structured abstract segment labels found in MEDLINE records into metaterms and metacategories and for her assistance in validating the structured abstract algorithm.
This research is supported by the Intramural Research Program of the National Institutes of Health, National Library of Medicine.