May 19, 2022

University of Illinois Professor Uses Big Data to Research History of Gender in Fiction, Project Mined Data From 104,000 Books Found in HathiTrust Digital Library

From the U. of Illinois News Service:

The number of women writing works of fiction dropped dramatically from the middle of the 19th century to the middle of the 20th century, and the prominence of female characters in works of fiction declined as well.

At the same time, however, the gender differences between male and female characters became weaker. Ted Underwood, a University of Illinois professor of information sciences and of English, came to those seemingly conflicting findings when he used data-mining tools to look at 104,000 books written over a period of more than 200 years.

Underwood and his colleagues, David Bamman of the University of California, Berkeley and U. of I. graduate student Sabrina Lee, explored the significance of gender in fiction by using an algorithm to look at books in the HathiTrust Digital Library.

Their findings are published in the Journal of Cultural Analytics.


Underwood would not be able to ask large-scale questions about literary history over a broad timeline without machine learning and access to a large digital library.

“Machine learning allows us to pose questions about concepts, like gender, that lack a clear definition,” he said. “Models using evidence from different historical periods can learn to define masculinity or femininity differently.

“The HathiTrust Digital Library is a great resource. We wouldn’t have been able to say anything much after 1923 without HathiTrust sharing information from those volumes, because they are under copyright.”

The researchers have shared the dataset they used and Underwood hopes others will use it to pose new questions about the history of gender in fiction.

Read the Complete Article

About Gary Price

Gary Price ( is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. Before launching INFOdocket, Price and Shirl Kennedy were the founders and senior editors at ResourceShelf and DocuTicker for 10 years. From 2006-2009 he was Director of Online Information Services at, and is currently a contributing editor at Search Engine Land.