October 6, 2015

Recently Published Article: “The Number of Scholarly Documents on the Public Web”

Below is  a recently published article co-authored by someone we’ve admired for well over a decade, Dr. Lee Giles at Penn St. University.

Dr. Giles is one of the developers CiteSeerX, a wonderful specialty open-web database/search engine that focuses on info tech/computer science scholarly literature. The original CiteSeer, predates Google Scholar by several years.

By the way, “Seer” code is open source and a number of other “seers” were also released at various points over the years.

Now, to the research article.


The Number of Scholarly Documents on the Public Web


Madian Khabsa
Penn State University

C. Lee Giles
Penn State University


May 9, 2014


The number of scholarly documents available on the web is estimated using capture/recapture methods by studying the coverage of two major academic search engines: Google Scholar and Microsoft Academic Search.

Our estimates show that at least 114 million English-language scholarly documents are accessible on the web, of which Google Scholar has nearly 100 million.

Of these, we estimate that at least 27 million (24%) are freely available since they do not require a subscription or payment of any kind.

In addition, at a finer scale, we also estimate the number of scholarly documents on the web for fifteen fields: Agricultural Science, Arts and Humanities, Biology, Chemistry, Computer Science, Economics and Business, Engineering, Environmental Sciences, Geosciences, Material Science, Mathematics, Medicine, Physics, Social Sciences, and Multidisciplinary, as defined by Microsoft Academic Search. In addition, we show that among these fields the percentage of documents defined as freely available varies significantly, i.e., from 12 to 50%.

Direct to Full Text Article ||| PDF Version

Note: We are working to see the current status of Microsoft Academic Search (MAS). The search tool is still available online but we’re not aware if it’s still being developed.

Longtime readers of infoDOCKET know that we are/were big admirers (and users) of this search tool. Regardless of what we learn for older material (let’s say pre-2013) MSA remains a potentially valuable research tool you should know about.

Our friend and librarian colleague, Lee Dirks, who was central in the development of MAS was tragically killed along with his wife Judy in August 2012.

Gary Price About Gary Price

Gary Price (gprice@mediasourceinc.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. Before launching INFOdocket, Price and Shirl Kennedy were the founders and senior editors at ResourceShelf and DocuTicker for 10 years. From 2006-2009 he was Director of Online Information Services at Ask.com, and is currently a contributing editor at Search Engine Land.

Craft Exceptional Digital Experiences for Your Users
Digital UX LJ and ER&L present an exceptional roster of library and user experience (UX) experts for our newest online course, Digital UX Workshop: Crafting Exceptional Digital Experiences for the User-Centered Library. During this 5-week online workshop, you will explore why UX matters, and how to sell user-centered design (UCD) to leadership within your organization. Whether you want to redesign your website, revamp your user interface, create a new discovery tool, implement e-resources, or develop a mobile app—you’ll have a tangible product by the end of the course.