|Home | About | Journals | Submit | Contact Us | Français|
To characterize PubMed usage over a typical day and compare it to previous studies of user behavior on Web search engines.
We performed a lexical and semantic analysis of 2,689,166 queries issued on PubMed over 24 consecutive hours on a typical day.
We measured the number of queries, number of distinct users, queries per user, terms per query, common terms, Boolean operator use, common phrases, result set size, MeSH categories, used semantic measurements to group queries into sessions, and studied the addition and removal of terms from consecutive queries to gauge search strategies.
The size of the result sets from a sample of queries showed a bimodal distribution, with peaks at approximately 3 and 100 results, suggesting that a large group of queries was tightly focused and another was broad. Like Web search engine sessions, most PubMed sessions consisted of a single query. However, PubMed queries contained more terms.
PubMed’s usage profile should be considered when educating users, building user interfaces, and developing future biomedical information retrieval systems.
PubMed is an interface to MEDLINE, the largest biomedical literature database in the world. The United States National Library of Medicine (NLM), part of the National Institutes of Health (NIH), publishes general usage statistics, 1 but not detailed query information. Information Retrieval (IR) researchers use log analyses 2–4 to understand user behavior such as typical query length and complexity, 5,6 how many results users look at, 7 and use of Boolean operators. 8 These data provide insight into system performance, and inform user interface design and user education. The goal of this study was to obtain similar insight into PubMed usage for the IR community, and for providers of biomedical search systems, educators, and the general public.
Query logs are usually derived from server logs and contain queries issued by users. Queries are traditionally grouped into sessions, which are series of related queries issued by the same user. Analyses of Web search engine query logs form the foundation of what we know about user searching behaviors on the Web.
In a 1998 study focused on AltaVista, Silverstein et al. found that most users issued simple Web queries of three or fewer terms, used operators in approximately 20% of cases, and rarely went beyond the first page of results. 2 Similarly, Jansen et al. studied Excite and found that 66% of users issued only one query and those Web queries were usually short. Users were equally likely to narrow the query by adding terms, or broaden by removing terms, during a single session. 4 As in the AltaVista study, few users clicked on results after the first page, although they reviewed some results after the first page. 4,9 Chau et al. analyzed the query log of a Utah government site search engine and found that this special-purpose search engine had a different usage profile than general purpose engines. 10 Thus, PubMed may have a different usage profile than general Web search engines.
The NLM estimated in 2002 (the last year for which we could obtain this information) that one third of PubMed’s users were members of the general public, while the remaining two thirds were health care professionals and researchers. 11 MEDLINE users leveraged its unique features 12 and studies show that experienced MEDLINE users such as medical librarians perform searches with higher recall and precision than novice clinicians or members of the general public. 13,14 PubMed users may employ different search strategies than Web search engine users.
Three kinds of queries have been characterized according to their underlying intent. 15 “Informational queries” are intended to satisfy information needs on a topic. For example, a user may search for “myocardial infarction.” In contrast, “navigational queries” are intended to retrieve a specific document or set of documents. For example, the query “j am med inform assoc [journal] AND 2006 [dp] AND 96 [pg]” intends to retrieve a specific article. When users issue “transactional queries,” they are searching to perform Web-mediated activities such as shopping or banking. Transactional queries do not have a direct PubMed equivalent.
The distinction between informational and navigational queries reflects the distinction between IR and database access. Whereas IR focuses on access to relatively unstructured data (e.g., free text), database management systems provide access to highly structured data. Therefore, identifying which records to return is a critical issue in IR. In contrast, compact storage and efficient retrieval are critical database issues. If PubMed users issue primarily navigational queries, then researchers should focus on optimizing database access. However, if informational queries are common, then IR issues must be addressed.
The goal of this study was to understand PubMed usage. Specifically, we were interested in the length of a typical query/session, the size of the result sets, use of Boolean operators, whether queries were informational or navigational, common search topics, and search strategies.
We obtained a single day’s query log from PubMed, anonymized by the NLM to protect user privacy. The file is publicly available at ftp://ftp.ncbi.nlm.nih.gov/toolbox/pubmed/query-logs, dated October 17, 2005, and is described as “a typical day,” but the date of collection is not provided for confidentiality reasons. The file includes: user ID (scrambled), timestamp (seconds since midnight EST), and the query string as entered by the user. According to the NLM, the user ID is provided “so that multiple queries from the same user can be matched” 16 and does not rely solely on the IP address or cookie on the user’s computer. In contrast to previous log analyses, our data did not include the entire server log. Specifically, we did not have access to results returned in response to the query, user selections (clicks) or page views.
The log file contained 2,996,301 queries issued by 627,455 different users. Each query represents a single “question” submitted by the user to PubMed. Thus, a session consists of one or more queries (). The log file included all queries issued over 24 hours, from midnight to midnight. We arbitrarily excluded very prolific users (over 50 queries/24 hours), since they could represent institutional proxies or programmatic searchers (“bots”) rather than individuals. This represented 2,941 users (0.47%) who issued a total of 307,135 queries. We analyzed the remaining 2,689,166 queries. From here on, we refer to this as our entire dataset.
We first took a sample (referred to as “query sample” from now on) of the log file. We used a random number generator to include each query from the log file with a 0.001 probability, which gave us a sample of 2,708 queries. To retrieve result counts and PubMed’s Medical Subject Headings (MeSH) translation for each query in the sample, we submitted them to PubMed via the E-Utilities interface (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html, accessed January 20, 2006). We computed the mean, median, and standard deviation for the number of results per query. We used the MeSH translation to classify queries as informational or navigational. Queries that contained only bibliographic tags (e.g., [pdat], [au]) were deemed navigational, according to the algorithm shown in . In other words, queries were considered informational by default, and were counted as navigational only if positively identified as such. We also compared the number of results retrieved by navigational and informational queries. To verify our query-classification algorithm, we classified the same sample manually. We counted queries in which users searched for authors’ names exclusively, citation information (like journal name, date of publication, and page numbers) exclusively, or explicit MeSH Terms. If the query was not explicit but the intent was clear (for example, “Smith, AB” is not mapped to an author tag) we classified it according to its intent. We assumed that all other queries were simply textual searches. Thus, navigational queries should be equivalent to the manually classified author and citation queries.
We performed a term analysis to find the most common words and phrases users entered into PubMed. We lowercased each query to eliminate the effect of case. A term was defined as a string separated from others by punctuation, white space, or a string of characters contained within square or curly brackets, or quotes. For example, [MeSH Major Headings] was a single term. We determined the most common terms by counting every occurrence of each distinct term in the query log (excluding single letters and punctuation) and sorted in descending order. We reported both the most common terms and the most common search field tags. We were unable to group equivalent tags (e.g., [author]=[au]) because we were not able to obtain a definitive list of equivalents from the NLM.
We then performed a second order analysis similar to the one described in Silverstein et al., 1998 2 that detects two-term correlations, regardless of their relative positions in the query (i.e., terms did not have to be adjacent to each other). We computed a correlation coefficient ρ to judge the strength of the relationship between terms in each pair. For Boolean data (occurrence/non-occurrence), ρ is related to χ2, a statistical measure of deviation from expected frequencies by the formula χ2=n × ρ2. For example, if the terms “gastric” and “cancer” appeared frequently in the same query, but not independently, they had a high correlation. This analysis required quadratic storage space. Thus, we arbitrarily considered only pairs of the 25,000 most common terms. The list was filtered using correlation and frequency cutoffs. Correlations that contained stopwords or bibliographic tags were considered uninteresting and were removed manually.
To determine general search topics, we mapped all queries to MeSH (2005 edition) using the NLM’s Metamap batch server (http://skr.nlm.nih.gov/) with the default processing options, plus –M “MMI output.” We weighted each MeSH term according to the number of mappings for that particular query. The sum of scores for each query was one. For example, if a query was mapped to three MeSH terms, each term was weighted by 1/3. We used the MeSH hierarchy to categorize terms into its top level. We drilled down into the “disease” category (second level) to determine the most popular clinical topics. When a term was classified into multiple categories, we counted its contribution to all categories.
To gain insight into users’ search strategies, we grouped related queries. PubMed’s user hash identifies all queries by the same user, but does not consider their time or topic. For example, a user’s query history could include queries for “myocardial infarction AND aspirin” at 11:13 AM, “myocardial infarction prevention AND aspirin” at 11:37 AM, and “gastrointestinal stromal tumors” at 12:00 PM. In this hypothetical example, the user made her first query more specific, probably to retrieve fewer, more relevant results, and then switched topics completely. To perform automated analyses, we must be able to group related queries into sessions. This was traditionally done using time thresholds, i.e., if the user waited more than a certain number of minutes, then a new session began. Since there are no prior PubMed log analyses using sessions, and PubMed users might be different from general Web searchers, separating sessions by time may be overly simplistic. Therefore, we performed a semantic analysis over the entire data set to separate users’ queries into sessions.
We relied on detecting a change in topic by evaluating the semantic distance between consecutive queries. Semantic distance reflects difference in the meaning of two concepts. For example, “dog” is closer to “cat” (as they are both mammals) than to “pterodactyl,” so the semantic distance between “dog” and “cat” is smaller than between “dog” and “pterodactyl.” By measuring the semantic distance between queries, we expected to group them into sessions better than with arbitrary time thresholds.
We evaluated this claim by performing a small pilot study. We printed a random sample of 2,390 queries issued by 351 individual users (as identified by the NLM-provided user hash). Two of the authors (LYT and JRH) independently identified session boundaries. Sessions were defined as sets of queries in which the user was pursuing the same information need. We compared the results of this exercise to dividing the queries into sessions using a time cutoff (0 to 120 minutes in 1 minute increments) and to our MeSH-based semantic classifier. We found that the semantic classifier had better concordance with human judgment than all time cutoffs. We also used these results to determine the best distance threshold between sessions (3.8), which was used for the rest of the analysis.
We used MeSH mappings for queries and computed semantic distance between consecutive queries. Distance was defined as the shortest path between pairs of concepts on the MeSH tree as shown in . For this analysis, we only used the highest scoring mapping returned by Metamap for each concept. When we could not map queries to MeSH terms or concepts, we used WordNet (http://wordnet.princeton.edu/). In this case, we walked WordNet’s hypernym/hyponym tree to obtain distance measurements directly from the query as entered by the user. We used WordNet 2.0 via the pywordnet Python interface (http://osteele.com/projects/pywordnet/). To simplify implementation, we only used the WordNet noun database.
We assigned weights to each edge according to its depth in the respective tree. We reasoned that “deeper” steps represent less difference than “shallower” ones. For example, in , “myocardial infarction” is closer to “myocardial ischemia” than the latter is to “heart diseases.” The distance score from “heart diseases” to “myocardial ischemia” is thus greater than the one from “myocardial ischemia” to “myocardial infarction.” We used one divided by the depth of the topmost node in a pair as a score: the steps in descending order were worth 1, 0.50, 0.33, 0.25, 0.2, etc. points. Our use of a depth-conscious measure has precedents in the literature and, in particular, is similar to the Leacock-Chodorow distance. 17 When one or both of the queries in a pair contained more than one term, we paired each term to its closest counterpart in the other query. We then added the individual distances between pairs of terms to obtain a total distance. When we could not compute a distance, we assumed that the queries were part of the same session.
Once the queries were divided into sessions, we used a smaller random sample (called “strategy sample”) that consisted of approximately 1% of users (6,000 users who issued 25,650 queries) to study search strategy. We eliminated sessions with a single query from the strategy sample, and used the E-Utilities to obtain result counts. We determined whether users looked for broader or more specific result sets by comparing the number of results returned by consecutive queries within the same session. For example, if a user’s first query in a session retrieved 12,500 articles and her second query retrieved 700, we deduced that she narrowed her query.
shows basic descriptive statistics. shows the distribution of queries per user. Of the 2,708 queries in the query sample, 436 (16.1%) had no results. The result sets from the remaining 2,272 queries are described in and . Of this sample, 599 queries (22.1%) were classified as navigational. Manual classification of the same dataset showed that approximately one quarter of queries were navigational. Specifically, 22.9% of the queries contained only authors’ names or a PubMed author tag; 2.47% contained citation information, and 0.26% had both author names and other citation information. The remaining three quarters (74.4%) were informational searches, and none used MeSH terms explicitly. Excluding queries with no results, approximately 50% of users issued tightly focused queries that returned ten results or fewer. In this sample, navigational queries had a median of 37 results per query (range 1–133,100) and informational queries had a median of 100 (range 1-4,845,000). The difference is statistically significant (Mann-Whitney test, p < 0.001).
Of 2,689,166 queries, 302,386 used at least one Boolean operator (11.24%). The exact number is difficult to ascertain since, officially, PubMed recognizes only uppercase Boolean terms 18 (). However, it rewrites lowercase Boolean operators to uppercase internally, apparently trying to match the user’s intent. This limitation arises because MeSH terms can contain any word including Boolean operators. For example, “Bone and bones” and “not expressed in choriocarcinoma clone 1, human” are MeSH terms. The query log contained 695,018 unique terms. The most common terms are listed in (PubMed stopwords 18 were removed). The 50 most common PubMed tags are listed in . While term counts varied considerably, the majority of queries had fewer than ten terms () with a median of three terms per query.
The second order analysis identified 2,552,940 highly correlated term-pairs. We filtered the list down to a manageable size by arbitrarily keeping all term pairs with a correlation coefficient ρ greater than 0.6, and over 100 occurrences in the query log. This yielded 26 term pairs (). Uninteresting pairs, like those involving stopwords, did not have high correlation. Evidence-based medicine-related term pairs figure prominently in this list (“randomized controlled,” for example, was the most frequent term pair, although it was present in only 0.13% of queries). Other highly correlated terms may be seasonal, such as phrases related to Lyme disease (“burgdorferii garinii”).
MetaMap provided MeSH mappings for 1,495,354 of the queries (55.61%). The most common MeSH categories were “Chemicals and drugs” (24.61%), “Diseases” (20.16%), “Biological sciences” (10.79%), and “Anatomy” (10.27%) (). The subdivisions of the “Diseases” category are shown in . The most common disease category was “Pathological conditions, signs and symptoms,” with 13.03% of all Diseases.
We performed a semantic distance analysis on all queries to divide them into sessions. The majority of users conducted a single search session during the day (). The query log contained 740,215 sessions. Most of these sessions were short (62.75% had a single query) which is similar to Silverstein’s finding that 63.7% of AltaVista sessions consisted of a single query 2 ().
Excluding sessions with one query from the strategy sample left 4,997 sessions. Of these, 23.30% had monotonically increasing result counts, while 23.66% had monotonically decreasing result counts. The rest did not have a consistent strategy. Therefore, users broadened and restricted their searches in roughly equivalent numbers.
We found that PubMed users issued diverse queries on a broad range of topics without dominant phrases. Like Web users, PubMed users favored short queries and issued few queries per session. When they edited consecutive queries in a single session, they were equally likely to broaden or narrow the search. Approximately one quarter (22%–26%) of queries were navigational and three quarters were informational. Advanced MEDLINE features such as MeSH terms were seldom used.
Users issued a median of two queries, although there was large variability in query counts per user. Result set sizes had a bimodal distribution, suggesting that there were two classes of queries. There were focused queries with less than ten results and less focused queries with a mode of approximately 100 results. While we do not know whether users actually clicked on these results, the low numbers suggest that they preferred small result sets. Previous studies showed that experienced users were able to search MEDLINE more effectively. 13,14 The bimodal distribution reflects the distinction between navigational and informational queries; it may also be, in part, due to different usage patterns of professional, highly trained users compared to the general public.
PubMed queries had a median of three terms, higher than reported for Excite and AltaVista. Of PubMed queries 11.2% contained operators, which was lower than AltaVista (21.4%) 2 and similar to Excite (10.0%). 9 It is possible that PubMed users intended to issue more Boolean queries, but did not uppercase them properly. If we disregard case, 21.8% of the queries contained at least one Boolean operator (). Therefore, the intended number of Boolean operators was somewhere between 11.2% and 21.8%. In contrast to the differences in Boolean operator use, sessions were of similar length, with approximately as many users issuing a single query per session on PubMed as on Web search engines. The high frequency of sexual and pornographic terms in Web query logs was not seen on PubMed.
Lack of clickthrough information was one of our major limitations. We can only speculate about the number of results users actually reviewed. Other limiting factors were that we can only report on the typical day’s log, and thus we cannot exclude temporal artifacts. For example, one may wonder if an article on Lyme disease appeared in the press on the same day these data were captured, or if this level of interest in Lyme disease is constant.
Our findings regarding search strategy rely on the accuracy of our session separation algorithm. Our technique is better at grouping related queries than using a time cutoff, but less straightforward and requires an arbitrary threshold. While we believe that the technique is generalizable, this has not been empirically demonstrated.
In future work, we would like to strengthen and deepen this analysis by including clickthrough data and information on retrieved results, perhaps by using data from local Web proxy logs. In addition to knowing how users search and what they search for, we could determine whether they are successful. For example, a search where the user downloads the full text of an article via a link from PubMed would be considered more successful than a search where the user did not click on any results. Given clickthrough data, algorithms that learn from usage (implicit feedback) can be adapted to PubMed. 19
On the Web, few users review results beyond the first page. 2 If a significant proportion of PubMed users also focus on the first few results, then we need to develop ranking strategies that place the most important results at the top. 20 In future work we plan to leverage this information to improve biomedical IR.
We studied a full day of PubMed queries to characterize user search behavior. We found that PubMed searches resemble Web search engine queries in some respects, like session length. PubMed users issue a wide variety of queries on a large variety of topics without dominant search terms or topics. The majority of queries were informational. Therefore developing effective information retrieval strategies remains important. Our findings suggest that educators and PubMed user interface researchers should not focus on specific topics, but overall efficient use of the system. We also found that result sets come in two sizes, with some very broad queries. We hope that these results inform the design and evaluation of future biomedical information retrieval tools.
Supported in part by a training fellowship from the W. M. Keck Foundation to the Gulf Coast Consortia through the Keck Center for Computational and Structural Biology, NLM grant 5K22LM008306 and NCRR grant 1UL1RR024148.