|Home | About | Journals | Submit | Contact Us | Français|
Authors, editors and reviewers alike use the biomedical literature to identify appropriate journals in which to publish, potential reviewers for papers or grants, and collaborators (or competitors) with similar interests. Traditionally, this process has either relied upon personal expertise and knowledge or upon a somewhat unsystematic and laborious process of manually searching through the literature for trends. To help with these tasks, we report three utilities that parse and summarize the results of an abstract similarity search to find appropriate journals for publication, authors with expertise in a given field, and documents similar to a submitted query. The utilities are based upon a program, eTBLAST, designed to identify similar documents within literature databases such as (but not limited to) MEDLINE. These services are freely accessible through the Internet at http://invention.swmed.edu/etblast/etblast.shtml, where users can upload a file or paste text such as an abstract into the browser interface.
Searching for pertinent literature is an essential part of every scientist's life. There are many stages in the scientific process in which intimate knowledge of the appropriate literature is critical: (i) familiarization of a new area by a young scientist or a scientist whose research is taking on a new direction, (ii) monitoring the literature as the research progresses to capitalize on recent developments, measure ones competitiveness and avoid duplication of effort (1), (iii) development of reference lists during manuscript or grant application writing and (iv) compiling suggested reviewers when called upon to do so as part of a manuscript submission to a journal. For mature scientists, the reasons for interaction with the literature expand: (i) development of very broad knowledge when writing, for example, a review article, (ii) mastery of new areas in the role of student mentor or examiner and (iii) acquiring focused knowledge when called upon as a manuscript or grant application reviewer. For other scientific professionals, the literature is a resource for identifying colleagues: (i) identification of experts for advisory or steering committees, (ii) selection of reviewers for grants or proposals by government or private agencies, (iii) identification of experts for legal proceedings and testimony, (iv) finding starting points into the literature for novice or lay individuals by librarians and (v) identification of manuscript reviewers by journal editors.
The primary portal for the biomedical literature is PubMed (2,3). This web-based tool searches the Medline database using keywords and Boolean operators. The selection of appropriate keywords by the user requires some knowledge to choose wisely, and this often requires numerous iterations to sample the literature with hopes of finding the most relevant literature. Once the results of a query are presented to the user, the lists can be sorted by date, author or journal. Recent research has focused upon improving the quality and navigation of output (4–8).
There is sufficient information contained within the Medline database to overcome these limitations given a tool with appropriate query entry and result presentation methods. Scientists or professionals either generate in the course of manuscript or grant writing or are presented with concentrated information in the form of an abstract or other document. Given this, the keyword selection and optimization process can be bypassed if natural language free text, such as an abstract, can be submitted directly to a literature search engine. To do this, we have developed eTBLAST, which uses a hybrid scheme to extract and weight keywords contained within the submitted query to identify a subset of literature in Medline, and then performs a sentence alignment to compute a final quantitative score as a measure of similarity and, presumably, relevancy. This tool then outputs a list, similar to PubMed, but ranked instead by this similarity score. At this point, scientists can interact with the most relevant Medline literature much as they have done traditionally via date, author or journal sorting methods in PubMed. This similarity-ranked output can be further processed to compile lists and present output views which add value for the specific uses just outlined; identifying the most frequent and prominent authors as experts/reviewers, identifying the most frequent journals as targets for submission and inspection of the publication rate over time as a measure of novelty and topic popularity. It should be noted that eTBLAST and PubMed both find similar abstracts, but by different methods and PubMed's Related Links is limited to only finding similarity among the records currently in Medline, not arbitrary text, as is used by eTBLAST. There also are numerous other Medline keyword-based search tools (CiteXplore, HubMed and GoPubMed, for example) (8–10), including some of which have results post processors with some similar functionality (author and journal finding).
The server requires a text specimen that can be input via copy/paste, or by uploading a text-only file. Additionally an email input option is available to allow users to receive a URL pointing to the results. Results are stored for at least 1 month. The analysis is currently performed on a 20 CPU Linux cluster. The eTBLAST webserver has been up since 2003 and typical searches (of abstracts containing 100–200 words) against Medline, which currently has >16 million records, usually takes from 1 to 3min and is roughly proportional to the query length. Although Medline is expanding by about 500000 records per year, eTBLAST performance is continuously being improved through code optimizations and expansion of the number of CPUs in the cluster. There is also a backup 20 CPU Linux cluster which mirrors the primary cluster to guarantee high availability.
eTBLAST [see (9) for a detailed description of methods and performance statistics] returns a list of PubMed IDs (PMID) ordered by statistical similarity to the input text. Briefly, using a two-step process, eTBLAST computes a quantitative score. In step one a weighted keyword set extracted from the query is used to quickly search a database of indexed keywords in Medline, gathering the top 400 most similar records. In step two, a novel sentence alignment algorithm is used to refine the rank order of those similar records and compute a z-score. Each of the utilities presented herein performs a similar set of tasks on these results: (i) results are parsed to extract relevant articles (with similarity z-score > 3), (ii) authors or journals which are overrepresented are calculated and (iii) the results are returned to the user (Figure 1A).
On January 17, 2007at 17:40 the abstract from (13) was submitted to eTBLAST via the web browser at http://invention.swmed.edu/etblast/index.shtml. Results were returned after 120s. The query text contained 149 words, of which 58 were ‘stop words’. A collage of some of the output web pages is presented in Figure 1, discussed above, to illustrate the output user interface.
Potential reviewers are those who have published frequently in areas highly similar to the query. An author's name may appear on many citations in many different formats, so the last name and first initial are used. An ‘Expertise score’ is computed for each author that appears in any of the Medline records with an eTBLAST similarity z-score >3, and this Expertise score is used to generate a ranked list which is output to the user. The Expertise score for each author is computed as the sum over all records (1 to N) with a z-score >3:
where an arbitrary weight, wa, is assigned based on the author's position on the author list: the senior author (often last or only author on the list) receives a weight of 3, the lead (first) author 2, and contributing authors 1. ‘Corresponding authors,’ perhaps a good indicator of expertise, are not explicitly tagged in PubMed, and therefore cannot be used. Each record is also assigned a weight, wp, which is its similarity score normalized to the query's self-identity similarity score.
To distinguish between ‘true’ experts and those authors that appear at low frequency within the author lists of the highly similar records, we computed several Expertise score distributions to identify a threshold score. Two sets of queries, each containing 1000 members, were used as input to eTBLAST. The first test set consisted of 1000 Medline records randomly selected from all of Medline. The second test set of 1000 pseudo-random queries, generated with keywords randomly picked from Medline, with the same size distribution and word frequency distribution as Medline, were synthesized using the built-in Perl pseudo-random number generator [as described in (11)]. The top scoring authors (experts) were recorded and the score frequency distributions are presented in Figure 2. From these distributions, we were able to define an Expertise score threshold of 0.9, above which authors can be considered as having the relevant skills (based on their publication history) to be potential experts. This threshold is output on the expert list (Figure 1B). Finally, an expert with no publication in the last 10 years will be flagged as potentially inactive (retirement, change in focus, death, etc…).
The Journal Finder utility parses the user's eTBLAST results in a manner similar to the Expert Finder described above (Figure 1B). For those N records with a z-score >3, a Journal score is computed as follows:
where wp is defined above. The Journal Finder utility lists the highest scoring journals to the web browser ranked by the Journal score and the citations for the publications in that journal. A Journal score threshold, computed similarly to the Expertise score threshold, is also demarked on the output, and is set to 0.1. A benchmark of the Journal Finder utility was conducted using a different set of 4230 abstracts randomly selected from Medline. In 33% of cases, Journal Finder ranked the journal in which the abstract was published within the top 10 suggestions.
Authors, reviewers and others can also evaluate the research activity within a given research area as defined in, for example, a manuscript abstract, by the temporal variation of publications found to be similar to the query. For each similar Medline record as found by eTBLAST with a z-score > 3, the year of its publication is parsed from the search results. In this utility the publication year of each record is returned, and a simple count maintained of publications by year. This count is then normalized to the total number of publications for the corresponding year. A tabular output with the raw counts per year and the counts normalized to ratio of the number of publications in Medline in each year divided by the number of publications in 2005 (the basis year) is presented. A graphic is provided for the normalized count over the last 20 years (Figure 1D).
A score threshold (query self-identity score to similarity score >0.56) has been experimentally defined (unpublished data) and is used to identify and flag any records that are of unusually high similarity to the query as an aid in determining novelty of the topic defined by the query. These either represent abstracts that were taken from Medline for analysis or, if not, serve as a red flag that something very similar to the material being queried has already been published. In our test case, a similarity flag is raised for the paper containing the original abstract (Figure 1A).
The primary methods in which users interact with the results of Medline searches can be improved and expanded to enable quick and efficient suggestions for optimizing the manuscript writing and publication process, including review. Quantitative similarity scores computed for a text query, such as the abstract for a manuscript submitted to a journal for publication, against the primary biomedical bibliographic database, Medline, can be used to generate a ranked list of similar documents from which summary information about the authors, journals, similar work and dates can be of high utility. Scientists submitting to or editors of the more than 5000 journals represented in Medline can use this free web-based utility to speed the process of selecting or confirming appropriate journal selection, estimate a given articles novelty based on the relative similarity of its abstract and to select potential reviewers (experts), typically requested by journals at manuscript submission time.
Several caveats and potential enhancements to the system should be noted. First, as with any search system, similar articles sharing keywords but belonging to different fields may appear as relevant. Secondly, journal targets or experts are calculated based on frequencies of journals and authors in the eTBLAST results; these suggestions do not account for the publication volume of each journal. Finally, journal impact factors may be indicators of expertise level and are not considered. These enhancements may improve performance and are being evaluated as potential upgrades to the system.
Conflict of interest statement. None declared.