May 28, 2022

By Text-Mining the Classics, University of Nebraska-Lincoln Prof Unearths New Literary Insights

From a UNL News Blog Post:

Mark Twain once said that all ideas are second hand, consciously and unconsciously drawn from a million outside sources. Oscar Wilde put it more bluntly when he said that talents imitate, but geniuses steal.

Matthew Jockers has assembled a way to quantify the spirit of those sayings, particularly when it comes to certain authors and the impressions they left on other writers. And in doing so, he’s opened a new door for literary theorists to study classic literature.

Jockers, an assistant professor of English at the University of Nebraska-Lincoln, combines programming with text-mining to compare 18th– and 19th century authors’ works with one another based on their stylistic and thematic connections. The process, which he calls macroanalysis, crunches massive amounts of text to discern systematically how books are connected to one another – from each work’s word frequency and word choice to its overarching subject matter.


Using macroanalysis, Jockers processed digital versions of nearly 3,500 books from the late 1700s through 1900 – everything from giants like Jane Austen and Herman Melville to lesser-known writers such as Scottish novelist Margaret Oliphant. The process affixed each book with its own unique “signal,” allowing it to be plotted graphically near other books that it was closely related to, but farther away from books exhibiting more dissimilar styles and themes.

The result was a stunning graphical distribution [see the complete blog post] that displays connections, insights and trends both obvious and perhaps not so obvious about the period’s literary world. The systematic method found that, unsurprisingly, the books of Austen and Sir Walter Scott were highly original and influential; and that Melville’s “Moby Dick” was an outlier from much of the literary network of the period while still being related to several works by James Fenimore Cooper.


“The canonical greats are not necessarily outliers, often they’re similar to the many orphans of literary history that have long been forgotten in a continuum of stylistic and thematic change,” he said.

“Macroanalysis provides one method for studying the orphans and the classics side by side – a way of sifting through the haystack of literary history, of isolating and then studying the canonical greats within the larger population of less familiar titles.”

Read the Complete Post and View Several Data Visualizations

About Gary Price

Gary Price ( is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. Before launching INFOdocket, Price and Shirl Kennedy were the founders and senior editors at ResourceShelf and DocuTicker for 10 years. From 2006-2009 he was Director of Online Information Services at, and is currently a contributing editor at Search Engine Land.