January 18, 2022

Research Article: “GORC: A Large Contextual Citation Graph of Academic Papers” (Preprint)

The preprint linked below was recently shared by on arXiv.


GORC: A Large Contextual Citation Graph of Academic Papers


Kyle Lo
Lucy Lu Wang
Mark Neumann
Rodney Kinney
Dan S. Weld

Affiliation of All Authors: Allen Institute for Artificial Intelligence


via arXiv


We introduce the Semantic Scholar Graph of References in Context (GORC), a large contextual citation graph of 81.1M academic publications, including parsed full text for 8.1M open access papers, across broad domains of science. Each paper is represented with rich paper metadata (title, authors, abstract, etc.), and where available: cleaned full text, section headers, figure and table captions, and parsed bibliography entries. In-line citation mentions in full text are linked to their corresponding bibliography entries, which are in turn linked to in-corpus cited papers, forming the edges of a contextual citation graph. To our knowledge, this is the largest publicly available contextual citation graph; the full text alone is the largest parsed academic text corpus publicly available. We demonstrate the ability to identify similar papers using these citation contexts and propose several applications for language modeling and citation-related tasks.

Direct to Full Text Article
12 pages; PDF.

Use Semantic Scholar:

Report: “AI2’s Semantic Scholar Search Engine Now Takes in the Full Sweep of Scientific Papers”
October 23, 2019

About Gary Price

Gary Price (gprice@mediasourceinc.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. Before launching INFOdocket, Price and Shirl Kennedy were the founders and senior editors at ResourceShelf and DocuTicker for 10 years. From 2006-2009 he was Director of Online Information Services at Ask.com, and is currently a contributing editor at Search Engine Land.