May 21, 2022

New Research Article: “Testing Google Scholar Bibliographic Data: Estimating Error Rates for Google Scholar Citation Parsing”

The following article appears in the November 2018 issue of First Monday and was posted online earlier today.


Testing Google Scholar Bibliographic Data: Estimating Error Rates for Google Scholar Citation Parsing


David Zeitlyn
Institute of Social and Cultural Anthropology, School of Anthropology and Museum Ethnography
University of Oxford

Megan Beardmore-Herd 
Institute of Social and Cultural Anthropology, School of Anthropology and Museum Ethnography
University of Oxford


First Monday 
Vol. 23 No. 11
November 2018
DOI: 10.5210/fm.v23i11.8658


We present some systematic tests of the quality of bibliographic data exports available from Google Scholar. While data quality is good for journal articles and conference proceedings, books and edited collections are often wrongly described or have incomplete data. We identify a particular problem with material from online repositories.

From the Introduction

It is well known that Google prefers algorithmic or automated approaches to conglomerating metadata. This relies heavily on sources tagging and formatting their data in an appropriate “Google friendly” manner. Thus, Google Scholar team is perhaps limited in the functionality that they can deliver due to the inhomogeneity of the data that they are handling to deliver the GS Web site. However, the issues that we raise below, we believe to be within Google’s power to improve.

The problems with the data fall into three main classes: i) the completeness of data harvested from sources; ii) the representation of data harvested in Google’s own data system; and iii) the inhomogeneity and poor quality of data standards used in displaying and coding bibliographic information on Web sites (for example, the Dublin Core standard used as a basis for data holdings in many institutional digital repositories). Indeed, institutional repository records where full bibliographic information is typically included in a repository entry but it is not harvested by Google. Moreover, our study strongly suggests that some repository software seems not to put all the information into html metatags which GS harvest, so the reference generated is likely to be incomplete.


It is important for the reader to understand a key point around the reproducibility of the results listed below. It is not in the remit of the current project to source a full copy of the Google Scholar data or to obtain a copy of Google’s code base in order to allow the results below to be reproduced. The Google database and codebase is dynamically evolving, hence, our study cannot be replicated in the strictest sense of the term: the exact same searches can be repeated but they will not be searching the same dataset so the results may differ. This is an inevitable feature of research online and does not invalidate our results. For the record the data was collected over a period of months from October 2017 to February 2018 and the full dataset (which includes the data and time of the searches) is being made available for other researchers. We have saved both the lists of results received and the ris files that were downloaded and analysed.

These are available as an open data appendix to this article on Figshare; DOI:

From the Discussion and Conclusions

The recent moves to promote repositories driven by the laudable aims of open access publishing has introduced further noise into the system since there appears to be considerable inhomogeneity in the implementation of data standards, or possibly in clarity around how these standards should be applied. This has led to a mismatch between repository software and the harvesting protocols employed by GS. Our data suggest that the accuracy of GS ris files for books and repository records is unacceptably low (in the context of meeting academic needs) but seemingly quite easily improvable. At the very least repository software needs to report more and better quality information in html metatags and GS need to be better at providing the full set of data in the downloads they provide.

Direct to Full Text Article (approx. 2300 words)

About Gary Price

Gary Price ( is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. Before launching INFOdocket, Price and Shirl Kennedy were the founders and senior editors at ResourceShelf and DocuTicker for 10 years. From 2006-2009 he was Director of Online Information Services at, and is currently a contributing editor at Search Engine Land.