January 16, 2022

Topic Modeling: Avalanches of Words, Sifted and Sorted

From The NY Times:

David M. Blei of Princeton University is among those who are teaching computers to sift through the digital pages of books and articles and categorize the contents by subject, even when that subject isn’t stated explicitly.

For decades, of course, librarians and many others have labeled books and documents with keywords. “But human categorization can only go so far,” said Dr. Blei, an associate professor in computer science. “We don’t have the human power to read and tag all this information.”

To cope with the information explosion, Dr. Blei and other researchers write algorithms so that computers can sift through millions of works and find their common themes by sorting related words into categories. It’s a field called probabilistic topic modeling.


The Bookworm-arXiv interface is the latest in a series of tools developed by the Cultural Observatory. Late in 2010, in collaboration with Google, the lab released the Google n-gram viewer, which lets people search for a phrase of up to five words in Google’s database of scanned books and see the frequency of the words over time in a graph, Dr. Aiden said.

Project Discussed:

About Gary Price

Gary Price (gprice@mediasourceinc.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. Before launching INFOdocket, Price and Shirl Kennedy were the founders and senior editors at ResourceShelf and DocuTicker for 10 years. From 2006-2009 he was Director of Online Information Services at Ask.com, and is currently a contributing editor at Search Engine Land.