Topic Modeling: Avalanches of Words, Sifted and Sorted

March 25, 2012 by Gary Price

David M. Blei of Princeton University is among those who are teaching computers to sift through the digital pages of books and articles and categorize the contents by subject, even when that subject isn’t stated explicitly.
For decades, of course, librarians and many others have labeled books and documents with keywords. “But human categorization can only go so far,” said Dr. Blei, an associate professor in computer science. “We don’t have the human power to read and tag all this information.”
To cope with the information explosion, Dr. Blei and other researchers write algorithms so that computers can sift through millions of works and find their common themes by sorting related words into categories. It’s a field called probabilistic topic modeling.
[Clip]
The Bookworm-arXiv interface is the latest in a series of tools developed by the Cultural Observatory. Late in 2010, in collaboration with Google, the lab released the Google n-gram viewer, which lets people search for a phrase of up to five words in Google’s database of scanned books and see the frequency of the words over time in a graph, Dr. Aiden said.

Project Discussed:

The Cultural Observatory’s Bookworm (Data via arXiv)

Filed under: Data Files, News

About Gary Price

Gary Price (gprice@gmail.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. He earned his MLIS degree from Wayne State University in Detroit. Price has won several awards including the SLA Innovations in Technology Award and Alumnus of the Year from the Wayne St. University Library and Information Science Program. From 2006-2009 he was Director of Online Information Services at Ask.com.

Topic Modeling: Avalanches of Words, Sifted and Sorted

About Gary Price

Archives

FOLLOW US ON TWITTER

Topic Modeling: Avalanches of Words, Sifted and Sorted

About Gary Price

Archives

Related Infodocket Posts

FOLLOW US ON TWITTER