We found that the Anderson and Schooler model predicts biomedical document accesses. Specifically, we showed that document accesses followed a power-law distribution, desirability as a function of FOA followed a power-law distribution, and desirability as a function of ROA and FOA followed a power-law distribution. Finally, we evaluated two versions of the Anderson and Schooler model for predicting document accesses. The first model calculated desirability based on ROA and FOA, and attained a 0.668 correlation with the test data. The second model calculated desirability based only on FOA and had a correlation of 0.932 with the test data.
This study is the first to evaluate the Anderson and Schooler model for predicting biomedical document access, the first to show that the model is generalizable for bibliographic databases, and the first to investigate modeling desirability in the biomedical domain. The most similar study is Recker and Pitkow,23
which validated that the assumptions of the Anderson and Schooler model were valid for WWW retrieval. The results of Recker and Pitkow23
and the results presented here are mutually reinforcing and show that it is possible to model the desirability of documents in a variety of domains.
A weakness of our study is that the results could be specific to HAM-TMC users. Approximately one-third of PubMed users are from the general public.25
In contrast, the general public for the HAM-TMC comprises approximately 20% of the transactions (HAM-TMC library, personal communication). In addition, the logs captured sessions of authenticated users, and some may use PubMed without logging into the HAM-TMC library. Further, the fact that a document was accessed does not mean that the document met the user's information need or was necessarily useful to the user.
An interesting finding is that the FOA desirability model outperformed the ROA and FOA model. We hypothesize that poor performance was caused by very few documents in some groups, particularly in the highly accessed groups. Since the number of accesses follows a power-law distribution, there are many documents with few accesses and a small number of documents with many accesses. For example, there were 448 061 documents that were accessed once, but only six documents with 100 accesses.
To test this hypothesis, we created document groups where each group was ensured at least 100 documents. With 100 documents in each group, the FOA and ROA desirability model had a correlation of 0.973, and the FOA desirability model had a correlation of 0.984. Since the FOA and ROA performance improved, this indicates that the low performance was caused by the small number of documents that were accessed many times. The FOA desirability model is insensitive to small sample size since the same desirability is assigned to any document with a given FOA. In contrast, the desirability calculation based on FOA and ROA is sensitive to the small sample size given that desirability is calculated based on individual access patterns and then averaged for the group.
We found that desirability can be modeled, but our results do not explain why documents are desirable. There are many possible correlates including publication type (review article, meta analysis, etc), publication date, journal impact factor (JIF), citation count, and authors that were not investigated in this study. Further, the actual document content (ie, full text) may contain useful information not reflected in meta-data or information about the article such as frequency of past access. One possible effect of the observed power-law distribution could be the correlation between document accesses and citation counts given that citation counts have been shown to follow a power-law distribution.26
In addition, previous studies have shown that full-text downloads are correlated with citation count10
and can be used as proxies effectively overcoming some of the weaknesses of citation data such as citation lag.10
In future work we will investigate the type of information need best satisfied by desirability given that different metrics are optimal for different information needs. We hypothesize that desirability, given the known correlation between citation counts and document accesses10
and the utility of using citation data to find important documents,28
can be used to find important documents while avoiding some of the aforementioned weaknesses of using citation data. Finally, replicating this study using several months of PubMed query logs from the United States of America National Library of Medicine would eliminate bias created by constraining the analysis to a research medical center.