Our June editorial discusses the relationship between web traffic and citations. Specifically, can one predict how well any particular paper is cited years after publication, based solely on the number of downloads it receives immediately following its appearance online? Our preliminary analysis suggests that this relationship not only exists, but is surprisingly strong.
I’ll leave you to read the editorial for more of the background as to why we examined this relationship, but I will repeat a few keys things here. The main purpose of this post is to provide more of the details behind the data and analysis, and to initiate a good discussion.
Everyone has their own pet problem with impact factors, whether it be with the calculation method, the non-reproducibility of the actual values, or the disagreement over what IFs really represent, just to name a few. Despite all of these concerns (and more), these numbers are typically used to rate the importance or prominence of a particular journal, and thus by proxy, the importance of the individual papers published within. This is a seriously flawed use of association (see a previous Nature Neuroscience editorial discussing this concept), leading scientists to often equate the total number of citations with scientific impact, which can be fraught with problems. Searching for an alternative measure of impact that is perhaps free of the “bias of authority” (citing a paper because it is from a famous lab) or the “lemming bias” (citing a paper just because everyone else seems to do so whenever broaching a particular subject) led us to explore readership.
The readership of a particular article should roughly reflect the outside interest in the topic and the perceived value of the experiments within. Readership can potentially be quantified through the examination of download statistics from the website where a manuscript is published. These download statistics can be viewed in the same way as the NY Times bestseller list in the sense that the data are indirect; these numbers don’t actually measure readership as much as they measure access and potential readership. In other words, in our case, we are assuming that everyone who downloads a paper (especially the PDF version) actually reads it. Definitely a leap of faith, but nonetheless, we took this caveat in hand and pressed forward.
The papers in our initial dataset were published online between January 2005 and November 2005 (N = 215 papers). Only research articles and reviews were considered. Our download statistics are COUNTER-compliant, an initiative that provides libraries and publishers with more consistent and credible usage data. For the purposes of the editorial and here, the actual numbers are transformed. Our citation data come from Scopus, although we could have probably used Google Scholar or Thomson products just as easily (several studies have found an equivalence in the citation listings between Thomson’s Web of Science and Google Scholar, and there is no reason to believe that Scopus would be any different [Belew, 2005; Pauly & Stergiou, 2005]). Both sets of data were accurate as of the end of March. For web traffic data, total downloads within a particular time frame were calculated starting from the published Advanced Online Publication (AOP) date.
We initially noticed that immediate PDF downloads correlated better with eventual citation counts than did HTML downloads (R = 0.60 vs. 0.65 for HTML and PDF downloads, respectively). Therefore, we focused on PDF web downloads for the remainder of the analysis. It is important to note that this measurement is independent of citation or web traffic differences between fields or different types of papers. The lowest cited, least downloaded paper contributes equally to the weakness or robustness of the correlation as does the most highly cited, heavily downloaded papers.
As the download time frame was extended, the correlation progressively increased up until 180 days post-AOP, but fast-forwarding to 1 year, the correlation dropped significantly (Fig. 1). This makes sense in retrospect, since web traffic typically declines with time (as a paper becomes “old news”), while citation rates increase with time. The divergence in these two measurements dramatically affects the correlation. This peak correlation between downloads and citations at 6 months was also observed in a previous study that examined the relationship between web traffic and citations for papers deposited in the arXiv pre-print server (Brody et al, 2006).
Figure 1 The correlation between downloads and citation counts increases up until 6 months, and then dramatically decreases at 1 year. Correlation coefficients are graphed as a function of time.
We next decided to see how well web download data could predict eventual citations. For this analysis, we calculated a linear best-fit equation for the data graphed in the editorial. We then took all papers published in Nature Neuroscience within the first 3 months of 2006 (N = 55 papers), and used their 90 day PDF download numbers as the ‘X’ input into the equation. This yielded a series of citation values that would produce a predicted linear best-fit line for the 2006 data. Comparing this line to the actual best-fit line for the data, we see that although they are different, the slopes are nearly identical, suggesting that there is an offset in our predicted values, biased towards higher citation rates in a systematic fashion (Fig. 2). This offset could arise because the citation data is not mature enough for papers published so recently, with actual citations lagging behind those predicted by the model, in general.
Figure 2 PDF downloads vs. citations counts for 2006 papers. Predicted line derived from calculations using 2005 best-fit equation.
Finally, we decided to test how this relationship would hold up across another discipline. Previous studies examining downloads vs. citations found that physics and math preprints (Brody et al, 2006) and a subset of the medical literature (Perneger, 2004) revealed a similar positive correlation between downloads and citation counts for individual papers. We extended our own study to papers published in Nature Genetics (N = 168) for 2005. Again, we found a strong correlation between immediate PDF downloads and eventual citation counts (R = 0.71) (Fig. 3). Thus, this relationship is likely to hold up across various disciplines, across journals with different impact factors, and includes pre-prints as well as published articles. With studies suggesting that open-access articles receive more citations than those published behind firewalls (Eysenbach, 2006), it would be interesting to determine how open-access articles (with a presumed higher readership, or at least potential readership) fare in this type of analysis.
Figure 3 PDF downloads vs. citation counts for 2005 articles published in Nature Genetics.
We realize that this analysis is enticing at best, potentially providing a piece of an alternative solution for deciphering the impact of an individual paper. In this current scientific climate where tenure and grant funding decisions are influenced by flawed metrics like impact factor, it is important to make good use of all available technology in an attempt to realize a better system of measuring the scientific impact of any particular paper. This analysis is obviously preliminary and flawed in its own ways, but perhaps metrics such as paper downloads can find a place in a compilation of aggregated stats, painting a more accurate and informative picture of manuscript influence.
This analysis was conducted jointly with Hilary Spencer of Nature Precedings. We would like to thank Jamie Sampson for assistance in acquiring the download statistics.
UPDATE: Sorry it took so long, but here are the plots of the data (the NN data from the editorial and the NG data from above) without the log scales. This was by request.