While the biomedical literature contains millions of references in the Medline database, it is likely that in many particular situations the topics of interest for a given user will be described in thousands or tens of thousands of abstracts and only a handful of those abstracts will be relevant for this user. Finding those relevant abstracts for a given biomedical domain using the main search engine available for Medline, i.e. keyword-based Boolean PubMed, requires a set of words related to the topic of interest and this necessitates domain-specific knowledge. And then, even for an expert user, dealing with numerous unranked results may be detrimental for the selection of relevant papers. Retrieval of interesting documents can benefit from text mining tools that do not require expert knowledge from the user and that are able to order the results by relevance.
The MedlineRanker webserver provides a fast and flexible way to rank the biomedical literature without expert knowledge. Querying is not limited by a complex query syntax, a controlled vocabulary, or any existing annotation of the literature. There is only one input required, a list of abstracts related to the topic of interest, to start the ranking of recent abstracts and the determination of discriminative words. This list of abstracts can be set directly by the user or automatically constructed using biomedical terms. Optionally, this list can be compared to a user-provided set of abstracts instead of the whole Medline database. This may produce a better ranking of closely related abstracts. Moreover, different sets can be ranked, including the most recent months or years of Medline and also user-provided abstracts. The latter can for instance be used to focus on a given database or a given gene, by providing a list of abstracts related to that database or gene. Our tool can process tens of thousands of abstracts in a few seconds, and approximately one million per minute when ranking the most recent years of Medline.
The MedlineRanker will produce more accurate results if the user provides a training set with enough abstracts to define the topic of interest. In our experience and as shown above, 100–1000 abstracts are appropriate for most of the topics, but providing more abstracts is likely to improve the method's precision. Of course, the more homogeneous are the abstracts related to a topic, the better the ranking will be. One can get an idea of the predictive performance by observing the statistical output from the tool (C).
The MedlineRanker can be compared to two other Medline data mining tools that also use sets of Medline entries as input: PubFinder (17
), which has not been updated for several years, and MScanner (10
), which uses a different method. The latter is different in its way to select discriminative features: it uses mainly abstract annotations (MeSH terms and journal identifiers), whereas MedlineRanker uses only nouns extracted from abstract texts. As a result, MedlineRanker can be applied to all publications with an English abstract, including those with incomplete or missing annotations, while this is not possible with MScanner. This was illustrated with a benchmark ranking Arabidopsis
-related abstracts according to host–pathogen interactions ( and supporting ). Manual validations showed 99 true positives within the best hundred abstracts, and only 17 were properly annotated. Very few (a total of 108) abstracts published for the plant model Arabidopsis thaliana
received the tag ‘Host–pathogen Interactions’ in the MeSH Database. The results show that MedlineRanker can be also useful in the attribution of new MeSH terms.
Comparing the speed of different methods is complicated because there is often a trade-off between speed, capabilities and performance. MScanner, designed for maximum speed, sacrifices flexibility by forcing all of Medline to be ranked. Ranking annotated abstracts from the whole Medline takes approximately one to three minutes using MScanner. MedlineRanker, designed for flexibility, is not faster and processes approximately one million abstracts within a minute. Despite these differences, the two methods may be considered complementary since both behave very well but differently for various topics (). Nevertheless, MedlineRanker seems to perform better when few abstracts are used to define the topic.
The MedlineRanker webserver is more general than other comparable resources because it allows ranking user defined sets of abstracts, and it also allows the user to define a particular set as the reference. For example, one can choose to rank only the abstracts associated to a given database which provides Medline references such as some PPI databases or other molecular databases. This was illustrated above with a benchmark ranking all references linked from three PPI databases according to a complex topic: phosphorylation-dependent molecular processes. Manual validation of the best 100 abstracts selected by MedlineRanker shows the relevance of our method which can lead to a positive predictive value of 0.90. Yet, the method used here to rank abstracts is not dedicated to detecting relationships between concepts. The pay-off is speed as it can retrieve many candidates in few seconds. Using, in a second step, co-occurrence analysis of different concepts at the sentence level and semantic analysis, for which specialized tools are available, may help to focus on true positives in such complex situations.
In conclusion, the MedlineRanker webserver provides a fast and flexible tool to rank the biomedical literature without expert knowledge. It is not limited to any topic and can be useful for all scientists interested in ranking or retrieving relevant abstracts from the Medline database, including specific subsets like abstracts linked from particular databases.