January 17, 2022

New from JSTOR: Early Journal Content Metadata+OCR Data Bundle Now Available

Via a JSTOR Tweet

From the JSTOR’s Data For Research Web Site:

We are happy to also make a data bundle for the Early Journal Content freely available to those who would like to conduct data mining or other research across the content.

The data bundle for EJC includes full-text OCR and article and title-level metadata. The Read Me file explains the data in more detail. The currently available data bundle includes all the EJC as of September 7, 2011.

Please note that use of the Early Journal Content bundle is subject to the Early Journal Content Specific Terms and Conditions of Use.

To access the data bundle, please create an account using the very brief registration form, or login if you already have a Data for Research account. We plan to update the bundle on a semi-regular basis and to alert registrants when the bundle has been updated.

The format of the data bundle is a .tar.gz archive containing a readme file explaining the format of the data files, and an XML file for each article in the Early Journal Content bundle.

Once logged in, you can download the Early Journal Content bundle here

The size of the bundle is approx. 2.3 GB compressed, and 7.2 GB inflated.

About Gary Price

Gary Price (gprice@mediasourceinc.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. Before launching INFOdocket, Price and Shirl Kennedy were the founders and senior editors at ResourceShelf and DocuTicker for 10 years. From 2006-2009 he was Director of Online Information Services at Ask.com, and is currently a contributing editor at Search Engine Land.