SUBSCRIBE
SUBSCRIBE
EXPLORE +
  • About infoDOCKET
  • Academic Libraries on LJ
  • Research on LJ
  • News on LJ
  • Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Libraries
    • Academic Libraries
    • Government Libraries
    • National Libraries
    • Public Libraries
  • Companies (Publishers/Vendors)
    • EBSCO
    • Elsevier
    • Ex Libris
    • Frontiers
    • Gale
    • PLOS
    • Scholastic
  • New Resources
    • Dashboards
    • Data Files
    • Digital Collections
    • Digital Preservation
    • Interactive Tools
    • Maps
    • Other
    • Podcasts
    • Productivity
  • New Research
    • Conference Presentations
    • Journal Articles
    • Lecture
    • New Issue
    • Reports
  • Topics
    • Archives & Special Collections
    • Associations & Organizations
    • Awards
    • Funding
    • Interviews
    • Jobs
    • Management & Leadership
    • News
    • Patrons & Users
    • Preservation
    • Profiles
    • Publishing
    • Roundup
    • Scholarly Communications
      • Open Access

November 1, 2018 by Gary Price

New Research Article: “Testing Google Scholar Bibliographic Data: Estimating Error Rates for Google Scholar Citation Parsing”

November 1, 2018 by Gary Price

The following article appears in the November 2018 issue of First Monday and was posted online earlier today.
Title
Testing Google Scholar Bibliographic Data: Estimating Error Rates for Google Scholar Citation Parsing
Authors
David Zeitlyn
Institute of Social and Cultural Anthropology, School of Anthropology and Museum Ethnography
University of Oxford


Megan Beardmore-Herd 
Institute of Social and Cultural Anthropology, School of Anthropology and Museum Ethnography
University of Oxford
Source
First Monday 
Vol. 23 No. 11
November 2018
DOI: 10.5210/fm.v23i11.8658
Abstract

We present some systematic tests of the quality of bibliographic data exports available from Google Scholar. While data quality is good for journal articles and conference proceedings, books and edited collections are often wrongly described or have incomplete data. We identify a particular problem with material from online repositories.

From the Introduction

It is well known that Google prefers algorithmic or automated approaches to conglomerating metadata. This relies heavily on sources tagging and formatting their data in an appropriate “Google friendly” manner. Thus, Google Scholar team is perhaps limited in the functionality that they can deliver due to the inhomogeneity of the data that they are handling to deliver the GS Web site. However, the issues that we raise below, we believe to be within Google’s power to improve.
The problems with the data fall into three main classes: i) the completeness of data harvested from sources; ii) the representation of data harvested in Google’s own data system; and iii) the inhomogeneity and poor quality of data standards used in displaying and coding bibliographic information on Web sites (for example, the Dublin Core standard used as a basis for data holdings in many institutional digital repositories). Indeed, institutional repository records where full bibliographic information is typically included in a repository entry but it is not harvested by Google. Moreover, our study strongly suggests that some repository software seems not to put all the information into html metatags which GS harvest, so the reference generated is likely to be incomplete.
[Clip]
It is important for the reader to understand a key point around the reproducibility of the results listed below. It is not in the remit of the current project to source a full copy of the Google Scholar data or to obtain a copy of Google’s code base in order to allow the results below to be reproduced. The Google database and codebase is dynamically evolving, hence, our study cannot be replicated in the strictest sense of the term: the exact same searches can be repeated but they will not be searching the same dataset so the results may differ. This is an inevitable feature of research online and does not invalidate our results. For the record the data was collected over a period of months from October 2017 to February 2018 and the full dataset (which includes the data and time of the searches) is being made available for other researchers. We have saved both the lists of results received and the ris files that were downloaded and analysed.
These are available as an open data appendix to this article on Figshare; DOI: https://doi.org/10.6084/m9.figshare.5984845.

From the Discussion and Conclusions

The recent moves to promote repositories driven by the laudable aims of open access publishing has introduced further noise into the system since there appears to be considerable inhomogeneity in the implementation of data standards, or possibly in clarity around how these standards should be applied. This has led to a mismatch between repository software and the harvesting protocols employed by GS. Our data suggest that the accuracy of GS ris files for books and repository records is unacceptably low (in the context of meeting academic needs) but seemingly quite easily improvable. At the very least repository software needs to report more and better quality information in html metatags and GS need to be better at providing the full set of data in the downloads they provide.

Direct to Full Text Article (approx. 2300 words)

Filed under: Conference Presentations, Data Files, Journal Articles, News, Open Access, Publishing

SHARE:

About Gary Price

Gary Price (gprice@gmail.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. He earned his MLIS degree from Wayne State University in Detroit. Price has won several awards including the SLA Innovations in Technology Award and Alumnus of the Year from the Wayne St. University Library and Information Science Program. From 2006-2009 he was Director of Online Information Services at Ask.com. Gary is also the co-founder of infoDJ an innovation research consultancy supporting corporate product and business model teams with just-in-time fact and insight finding.

ADVERTISEMENT

Archives

Job Zone

ADVERTISEMENT

Related Infodocket Posts

Library of Congress Opens New Web Archive Collection Documenting Protests Against Racism & Learn About LC's Black History...

From the Library of Congress (Full Text of Announcement): A new web archive collection from the Library of Congress documents the civil unrest sparked by the police murder of George ...

AI: arXiv Announces New Policy on ChatGPT and Similar Tools

From an arXiv Blog Post: The recent release of AI technology that generates new text has raised serious questions among the research community. For one, “Can ChatGPT be named an ...

ResearchGate and De Gruyter Announce a New Content Syndication Partnership

From a Joint Statement (via De Gruyter): ResearchGate, the professional network for researchers, and De Gruyter, an independent academic publisher, have today announced a content syndication partnership that will see ...

EveryLibrary Releases 2022 Annual Report; ARL: Celebrating Black History Month 2023 & More News Headlines

ARL: Celebrating Black History Month 2023 EveryLibrary Releases 2022 Annual Report ||| Full Text Report (pages; PDF) IFLA: Applications for Public Library of the Year Available Later This Month

Ithaka S+R Releases "A*CENSUS II: Archives Administrators Survey" Findings

From an Ithaka S+R Blog Post by the Report’s Author, Makala Skinner:  On Tuesday, January 31, we published the A*CENSUS II Archives Administrators Survey findings. The Archives Administrator Survey Report is ...

“Food is a Right: Libraries and Food Justice" (A New White Paper From the Urban Libraries Council)

From the Urban Libraries Council (ULC): The Urban Libraries Council (ULC) announces today the release of its latest white paper, “Food is a Right: Libraries and Food Justice,” which addresses ...

Standards: W3C Re-Launched as a Public-Interest Non-Profit Organization; eLife’s New Model: Open for Submissions; & More News Headlines

Annual Report 2022: Highlights from the Data Curation Network arXiv Announces New Policy on ChatGPT and Similar Tools (via arXiv Blog) COPE in 2023 (via Committee on Publication Ethics) eLife’s ...

Journal Article: "A Free Toolkit to Foster Open Access Agreements"

The article linked to below was today published by Insights. Title A Free Toolkit to Foster Open Access Agreements Authors Alicia Wise Information Power Lorraine Estelle Information Power Source Insights 36 ...

Six Libraries Partner With GPO To Preserve Government Information

From the Government Publishing Office (GPO): Libraries at the University of Montana, the University of Memphis, and the University of Tennessee, Knoxville have signed Memorandum of Agreements with the U.S. ...

Michigan: Grand Rapids Public Library Finds Rare Set of 'Magic Lantern' Slides Showing Early Tuskegee Institute

From Fox 17 (Grand Rapids): The folks over at the Grand Rapids Public Library made a fascinating discovery while digging through their massive archives back in March 2021, and are ...

Journal Article: "Knowledge Work in Platform Fact-Checking Partnerships"

The article linked below was recently published by the International Journal of Communication. Title Knowledge Work in Platform Fact-Checking Partnerships Authors Valérie Bélair-Gagnon University of Minnesota-Twin Cities, USA Rebekah Larsen ...

State Library Looks to Install Book Vending Machines Around North Dakota; A Guide to Communicating With Others: Messaging...

A Guide to Communicating With Others: Messaging Apps (via Privacy International) De Gruyter Acquires Mercury Learning and Information Report by the French Committee for Open Science Working Group on Electronic ...

ADVERTISEMENT

FOLLOW US ON TWITTER

Tweets by infoDOCKET

ADVERTISEMENT

This coverage is free for all visitors. Your support makes this possible.

This coverage is free for all visitors. Your support makes this possible.

Primary Sidebar

  • News
  • Reviews+
  • Technology
  • Programs+
  • Design
  • Leadership
  • People
  • COVID-19
  • Advocacy
  • Opinion
  • INFOdocket
  • Job Zone

Reviews+

  • Booklists
  • Prepub Alert
  • Book Pulse
  • Media
  • Readers' Advisory
  • Self-Published Books
  • Review Submissions
  • Review for LJ

Awards

  • Library of the Year
  • Librarian of the Year
  • Movers & Shakers 2022
  • Paralibrarian of the Year
  • Best Small Library
  • Marketer of the Year
  • All Awards Guidelines
  • Community Impact Prize

Resources

  • LJ Index/Star Libraries
  • Research
  • White Papers / Case Studies

Events & PD

  • Online Courses
  • In-Person Events
  • Virtual Events
  • Webcasts
  • About Us
  • Contact Us
  • Advertise
  • Subscribe
  • Media Inquiries
  • Newsletter Sign Up
  • Submit Features/News
  • Data Privacy
  • Terms of Use
  • Terms of Sale
  • FAQs
  • Careers at MSI


© 2023 Library Journal. All rights reserved.


© 2022 Library Journal. All rights reserved.