January 23, 2022

Text Mining: HathiTrust Research Center Expands Services to Scholars

From the HathiTrust Research:

The HathiTrust Research Center (HTRC), a cooperative service of Indiana University, University of Illinois, and HathiTrust, has expanded its services to support computational research on the entire collection of one of the world’s largest digital libraries, held by HathiTrust. HathiTrust’s collections include over 14 million digitized volumes, including more than 7 million books, more than 725,000 US federal government documents, and more than 350,000 serial publications. HathiTrust’s collections are drawn from some of the largest research libraries in North America, including Indiana University and the University of Illinois.

Previously the HathiTrust Research Center supported analysis of only the public domain subset of the HathiTrust collection. HTRC is now the only place where scholars can perform text mining on the entire HathiTrust collection. In other words, researchers can now explore the entire collection, run an algorithm against all 14 million volumes, and make new connections and discoveries in the process.


At first, researchers will be able to access the HTRC collection through its Advanced Collaborative Services grants. This peer-reviewed grant process gives awardees dedicated HTRC staff time.

HTRC expects to make the full collection available through its secure HTRC data capsules in spring 2017. A features data set, derived from the full collection at both volume level and page level, will be released in fall 2016. “The upcoming release of the extracted features data derived from the full collection will enable researchers to have hands-on access to HT materials allowing scholars to refine their research questions for the corpus in the comfort of their own labs. Another game changing breakthrough for HTRC,” said J. Stephen Downie, the Illinois co-director of HTRC and a Professor at the Graduate School of Library and Information Science (GSLIS), University of Illinois.

On a Related Note…

Recorded Webinar (Audio + Slides): An Introduction to Text Mining Government Publications

About Gary Price

Gary Price (gprice@mediasourceinc.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. Before launching INFOdocket, Price and Shirl Kennedy were the founders and senior editors at ResourceShelf and DocuTicker for 10 years. From 2006-2009 he was Director of Online Information Services at Ask.com, and is currently a contributing editor at Search Engine Land.