|Home | About | Journals | Submit | Contact Us | Français|
To assess the capacity of textword queries to provide a comprehensive listing of articles on injury prevention and safety promotion (IPSP) concepts in a literature database.
All terms used to search SafetyLit (a database of scholarly literature selected for its relevance to the IPSP field) during the years 2000–2005 were listed and then examined to identify terms that are synonyms for the same concept. Terms were grouped by concept, the number of queries that used terms within each concept category were summed, and the concepts were then ordered by the total number of searches for each concept category. For each textword, the proportion of all articles for that concept that could be found by using it alone was calculated.
Each of the 25 most searched‐for concepts has 4 to 40 synonyms. Sixteen of the concepts required queries using two or more terms to find 75% of the available articles. Few searchers used a sufficient number of textword synonyms in their queries to return a complete listing of the available material.
On the basis of this study, queries using only one or two textword terms are insufficiently sensitive to find all relevant journal articles about an IPSP concept.
A literature search for articles on concepts in the injury prevention and safety promotion (IPSP) field may seem to be a straightforward process. However, conducting a focused and comprehensive search can be time consuming and difficult even if the searcher is a librarian with specialized training and experience. Many scientists and practitioners are unaware of their own lack of knowledge about the information sources and searching mechanisms they use.1 This may lead them to conduct their own searches when it is not convenient to consult a librarian.
When performing a query of a literature database, the searcher may use a combination of “textwords” and “descriptor terms” to find items that are of interest. A textword search (sometimes called a “free‐text search”) will find words or phrases from the article's title or abstract that exactly match what is entered in the query. A descriptor search involves entering a term selected from the thesaurus that is associated with the particular database.
Although the author of a manuscript may suggest keywords (a generic term for the important issues covered within an article) and some journals publish these with the article, the descriptors that are associated with the article when it is added to a database are assigned by personnel known as indexers. The indexer is limited to the specific vocabulary of the thesaurus developed for the particular database and must assign descriptor terms according to a selection protocol relevant to the purpose of the specific database and its target users. Although author‐suggested keyword terms may be used as a guide, the indexer, drawing from the thesaurus vocabulary specific to the database, has final authority over what descriptors are assigned.2,3
The US National Library of Medicine's PubMed4 literature database is widely used by many medicine and public health researchers because it includes many journals, is available for use at no cost, and is easily accessible to anyone with access to the internet. That and recommendations in instructions to authors to use the PubMed thesaurus vocabulary5,6 notwithstanding, its usefulness for IPSP work is limited.7 This is primarily because PubMed is designed to be a database for biomedical concepts, whereas the IPSP field includes concepts from many diverse disciplines. Similarly, the literature databases for other disciplines—for example, Ei Compendex8 used for scholarly literature in the engineering disciplines—focus on their own specialties. Continuing with the engineering example, even if an article contains information likely to be of interest to readers interested in an education, human behavior, or law enforcement approach to solving an IPSP problem, the indexers using the Ei Thesaurus9 to catalog articles in the Ei Compendex database may not assign terms that facilitate finding it because the primary purpose of the database is for storage and retrieval of articles of interest to engineers. An article will probably have different index terms in each database that lists it.
Although some of the general topics of interest to the IPSP researcher or practitioner are covered by descriptors in PubMed and other mainstream databases—for example, EMBASE,10 PsycINFO11—many important concepts are not. When descriptors exist for IPSP concepts, they are often for broad categories and lack the specificity for a focused search on an IPSP topic. In contrast, the EMTREE thesaurus for EMBASE and Medical Subject Heading (MeSH) system used for PubMed have very detailed descriptor listings that name hundreds of bacteria and viruses, but the system of descriptors for each of these databases omits terms for concepts vital to injury prevention. For example, the MeSH system allows the concepts “automobiles” or “motorcycles” to be indexed with specificity, but articles related to trucks (lorries) are indexed to a broader category that includes buses and other engine‐driven conveyances. To obtain information about vehicles such as minivans, sport‐utility vehicles, pickup trucks, articulated trucks, and multi‐trailer trucks, the typical searcher would need to use a textword search or scan results of a more general MeSH search that would produce many irrelevant items. A more sophisticated search is possible by combining multiple MeSH terms with textwords using search‐term tags and Boolean operators, but this type of search requires training and experience to be worthwhile. Even when such a query is performed, however, the searcher must have knowledge of all of the appropriate textwords to use in his or her search for it to be comprehensive. This poses a particular problem for the IPSP field. There is no standardized set of terms within the IPSP field to describe important concepts.15 Research relevant to IPSP is conducted by scientists from at least 30 distinct disciplines,12 and authors from each of those use terminology specific to their field.13,14
In 1999, a searchable bibliographic archive was established for SafetyLit, a database designed to bring together research articles from many of the disciplines relevant to IPSP,16 and a system was included to allow users to conduct textword searches. As of 15 December 2006, the archive contains more than 55000 articles. Articles are selected for SafetyLit from the English language contents of over 2600 scholarly journals. The SafetyLit website contains a listing of these journals and a detailed description of the article inclusion and exclusion criteria.17,18 Although material in SafetyLit is not yet indexed with terms from its own thesaurus, its growing comprehensiveness and availability at no cost have made SafetyLit an attractive resource for those seeking IPSP literature. This also makes SafetyLit a useful tool for examining the search behavior of those who use the database query system to find literature.
The purpose of this study was to assess the capacity of textword queries to provide a comprehensive listing of articles on IPSP concepts.
Terms and phrases used to search the SafetyLit article database during the years 2000–2005 were imported from visitor search logs into a list maintained using TextPad software (Helios Software Solutions, Longridge, UK). This list was sorted alphabetically to facilitate the detection of duplicate terms and the detection and elimination of obvious keyboard entry errors. Morphological variants (programme–program, tyre–tire) and, following a recent classification and retrieval model,19 common misspellings (Jacuzzi–Jaccuzzi) were retained. Author names and geopolitical labels were removed.
To identify morphological variants and terms that are synonyms for the same concept, the list was manually examined and search terms were grouped by concept. The process of grouping terms by concept was accomplished by printing each term on a separate slip of paper and then, through an iterative process, the term‐slips were physically grouped upon several large tables that had been placed around the perimeter of a room. The manual grouping was repeated until, through a consensus process involving three people, the terms were assessed to be optimally clustered. Two mechanisms were used to confirm that a search word was placed correctly into a concept‐synonym group and to determine if there are terms that may apply to more than one concept: (1) specialty dictionaries, glossaries, and thesauri (see Appendix A) were examined; and (2) textword searches were conducted for the years 1975–2004 using each term in several electronic databases of English language literature to identify the concept associated with each of the search words. These databases included: BEI,20 C2‐SPECTR,21 CINAHL,22 Dissertation Abstracts,23 Criminal Justice Abstracts,24 EMBASE,10 ERIC,25 Highway Vehicles Safety Database,26 ISI Web of Knowledge (including Science Citation Index and Social Science Citation Index),27 ITRD,28 Medline/PubMed,4 PsycINFO,11 and TRIS.29 Terms were also entered into the Google30 search engine.
Concept groupings were aggregated at the level of specificity indicated by the searcher. For example, search terms for the broad category “sports injuries/sports safety” were grouped together, whereas searches for injuries related to specific sports—for example, baseball injuries/baseball safety, rugby injuries/rugger injuries, skiing injuries/ski safety—were separately grouped apart from one another and from the broader category of sports in general.
After the concept‐synonym groups had been determined, the count of each query term within each of the top 25 concept categories was determined by searching the original (pre‐distillation) term list. The term counts within each concept‐synonym group were then summed, and the concepts were ordered by the total number of searches for each concept.
To determine the proportion of all relevant articles that could be found using a single textword query, the following tasks were performed. For each of the top 25 concepts, all the textwords within each term group were individually entered into the SafetyLit database. The listings of articles were combined, and an unduplicated list of articles concerning each concept was produced. The number of articles in the unduplicated list became the denominator used to calculate the proportion of articles that could be found using each individual textword search. To reduce any confusion created by the constantly increasing number of articles in the SafetyLit database, the proportion of all articles for that concept that could be found using each textword was calculated as though the user query had been conducted on 1 July 2006.
To assess whether searchers were repeating their queries using several textword synonyms, the server logs for the first two weeks of July 2006 were examined by hand to identify every appearance of each of the textwords for the top 25 most searched for concepts. For each textword used in a query, server logs were examined 90 min before and after each query to determine the number of times any synonym of the term was used. If multiple synonyms were used, the IP addresses of the computers used to perform the searches were compared to identify searchers who conducted multiple queries on the same concept.
The protocol for this study was assessed for ethical issues and accepted by the Internal Review Board at San Diego State University, San Diego, California, USA.
Of the 603254 terms collected from SafetyLit search logs, 31416 were clearly keyboard entry errors and were eliminated. Author and place names (42574) were removed, and 2061 phrases that were obvious outliers—for example, “persons killed during robberies of elite‐class casinos,” “children crushed while trying to surpass the world record for the height of a human pyramid”—were eliminated. From the remaining 527203 terms, the elimination of duplicate entries left 6634 terms to be examined and subjectively grouped by IPSP‐related concepts and their synonyms.
Table 11 lists the 5 most frequently queried concepts, their synonyms, and the number of articles returned when a search was performed using each textword synonym. (A table with the 25 most frequently queried concepts is available as supplementary material online at http://ip.bmj.com/supplemental.) Table 11 identifies, for each category, the total number of articles that may be found when all textwords are entered and an unduplicated list created. For each textword, the proportion of the total is provided. For each concept‐textword group, the most frequently used query term is highlighted in bold type. To reduce the size of the table, only those textwords that were used by three or more searchers are included.
Table 11 demonstrates that a query using any single textword will return only a subset of the relevant literature on the concept. For 18 of the 25 listed concepts, searchers would have had to perform two or more queries to find 75% or more of the available material on the concept that could be found if all of the textword synonyms had been used. Table 11 further demonstrates that the number of searches for literature on an IPSP concept has little correspondence to the number of articles on the concept that are available in the literature. This is particularly notable for the query terms “dog‐bite injur(y/ies)”, “teen suicide”, and “home smoke alarm(s)”.
Although not demonstrated by table 11 (in order to conserve space), the form of the textword entered into a query can have a great effect on the productivity of the search. For example, using a shortened form of a textword will return results that include both the singular and plural forms, but a query that uses the plural form will not find articles that only contain the singular form. This shortening process is termed “truncation” or “stemming” by information storage and retrieval experts. A textword search using “safety seat” will return 118 items, but a query using “safety seats” will find only 79 articles.
There was evidence that searchers were not routinely using multiple synonyms when seeking journal articles on a concept. During the 2‐week study period, 4792 searches were recorded; a search was conducted approximately every 4.2 min. In no case were four or more synonym terms used in the 180 min time bracket around each textword query. In only seven cases were three different synonym terms found in any 180 min interval. The IP address of the searchers matched in only four of these.
The results of this analysis demonstrate that, on the basis of a single database, queries using only one or two textword terms are insufficiently sensitive to find all relevant journal articles about an IPSP concept. Textword searches are highly particular and focused—that is, a textword search returns only articles that contain the exact word or phrase entered—and thus may return only a limited number of articles that are from a perspective with which the searcher is most familiar. For example, the SafetyLit database contains many more articles about the concept “school violence” than the 48 that were identified using the search term synonyms included in these analyses— the database contains 98 articles from the Journal of School Violence alone. Although many articles included in the database may contain more than one term for the same concept (both because of the lack of a standardized vocabulary and because some authors choose to use multiple terms for the same concept as a matter of literary style), the results from this investigation nonetheless show that most searchers are not obtaining comprehensive results.
This study is one of a series of investigations that may help to shed light on the knowledge, skills, and practices that IPSP researchers and practitioners use when seeking scholarly literature. Among the questions being researched are: What is the search behavior of those working in the IPSP field when they search other literature databases? Do they search themselves or have someone else perform the searches? What databases do IPSP researchers use to find literature? What training in search techniques have they had? What are the key scholarly journals for IPSP research and what bibliographic databases include them? For those who use PubMed and PsycINFO to find IPSP literature, what is the sensitivity and positive predictive value of their queries? What proportion of the English language IPSP literature is captured by SafetyLit and how does this compare with other databases?
One strength of this study is that it provides a real‐world demonstration of the problems with using textword searches to query databases for IPSP literature. The use of actual queries of the SafetyLit database over a 6‐year interval suggests that the article addresses areas of IPSP that are of interest to those who work in the IPSP field.
Its primary weakness is that the unduplicated count of all articles in the database (that could be found if all identified textword synonyms were used) underestimates the total number of articles available on a particular concept. This underestimation deflates the denominator for the proportions listed in table 11 and leads to an overestimation of the percentage of the true number of articles in the database that are available by using each texword listed.
The SafetyLit database limits its contents to articles that are themselves in the English language or that are in another language but with English abstracts. Thus, the textwords used for the searches were also from the English language. The generalizability of these results to other languages or to other English language databases is unknown.
Although searching by textword is usually highly focused, there are cases where a textword search may produce excessive irrelevant information, particularly for words with multiple meanings. For example, entering the textword “mobiles” will find partial listings for cellular telephones, nursery dangling mobiles, baby mobiles (baby walkers) and automobiles; entering the textword phrase “hot water heater” will find articles on tap water temperature, on fires, and on carbon monoxide poisoning. With a database such as SafetyLit where all material has been selected to be relevant to IPSP, this is a lesser problem than with a database designed for other purposes—for example, PubMed—where irrelevant articles could concern a wider variety of topics. Thus, it is probably more productive to use a database containing articles that have been indexed with descriptors, provided that the database has a focus on the IPSP field and the thesaurus was constructed to facilitate finding IPSP material. Until this resource exists, literature database queries should be conducted with care. A tool is under development that may aid a searcher's efforts to find a list of all synonyms necessary to conduct a comprehensive search for articles on an IPSP concept. A draft of the Injury prevention safety promotion thesaurus is available (Appendix A). Although it remains a work in progress, it may prove to be useful as a source of synonym terms even though it is incomplete.
A thorough search of the literature is important. An incomplete literature search may result in a distorted interpretation of the body of research on a topic.31 Decisions that are based on incomplete information are poorly informed and may waste time, work effort, and money, especially if that information is gathered from a few familiar sources using only search terms that are familiar. Poor decision‐making can block interventions needed to prevent injuries, disabilities, and deaths.32 At best, poor decision‐making is likely not only to delay implementation of useful projects but also to diminish the resources available for proper interventions after unsound projects fail.33,34
Entering a word or two into a database search field will return a listing of articles, sometimes a lengthy list of articles. A naive researcher may conclude that this listing is comprehensive when, in fact, the listing may be grossly incomplete. This report should raise an awareness among those who search the literature for IPSP material that retrieving a focused and comprehensive listing of relevant articles requires thoughtful planning and thorough investigation of all of the terms that may be applied to the concepts of interest.
I thank Lucie Laflamme for her wise counsel concerning my maintaining a concise focus upon what is necessary, and more importantly, avoiding what is unnecessary. My thanks go also to Pia M Johansson of the Stockholm Center for Public Health, Anne Ainsworth and Nilam Patel of San Diego State University, and Kristi Passaro for their useful suggestions and editing to help clarify the text.
IPSP - injury prevention and safety promotion
MeSH - medical subject heading
The 12‐page “List of glossaries and thesauri consulted” is available online at http://www.injujrypreventionthesaurus.com. A draft of the thesaurus (work in progress) is also available there.
Competing interests: DWL is the editor of SafetyLit. SafetyLit is currently supported through contracts with several government agencies of the State of California, USA. This research, however, was self‐supported. It was not supported by those government contracts. The SafetyLit electronic mailing list and website contain no advertising. SafetyLit is a free service.