As I was discussing about citations in academic papers during lunch today, I thought it’s time to write about some older paper dealing with the citation process. In 2002, Simkin and Roychowdhury published a paper entitled Read Before You Cite where they claimed that only about 20 % of the citers read the paper they were citing. They studied the distribution of misprints in bibliographic references and assumed a correlation between misprints and the fact that the author read the paper. At the first glance, this assumption seems to be quite logical, as an alert reader will find the errors in the bibliographic record. Simkin et al. present a nice analytical evaluation where they also showed that the misprint distribution follows a Zipf law. However, the correctness of the result simply depends on the correctness of the basic assumptions. And this is what I believe the problem of this paper, as at least in computer science, the citation process might have some more properties that were neglected in the paper.
My citation process is decoupled from my reading process. When discovering an interesting paper, I mostly print it out as this is more comfortable when taking notes and allows “offline” reading in the suburban train or bus. I’m too lazy to take my (heavy) laptop with me all the time but mostly have some papers in my bag. After reading the paper, I might file it away. There may elapse some time before I grab the paper again to cite it when working on a publication or writing a mail. However, as some time passed by since I got the paper, I might have forgot about some of the details needed for citing it (maybe the volume of the journal). Mostly, I write some short note on the heading of the first page that will remind me on the most important bibliographic data, but sometimes I just forget it. When citing the paper, I mostly use public databases (such as provided by the ACM) or access the authors web page to obtain everything I need in the BibTeX format, ready to cut and paste it into my bibliographic database. Nowadays, this technique is very convenient and fast. What if the record I just copied was erroneous? (sometimes even bibliographic records provided at the author’s page are erroneous!) Well, then I might spread another misprint as measured by Simkin et al.
All I want to say is that there not necessarily a correlation or even a causal connection between a misprinted bibliographic record and the fact whether an author actually read the paper or not. Moreover, a colleague draw up with a metric that may be more reliable: simply compute the amount of papers an author has to read per day (works only for authors writing tons of papers). However, as such an author will likely be an full professor or the head of the department that puts his name behind all works of his Ph.D. students, the most interesting question would be: did they read what they wrote?
Related information:
