September 23, 2020

Library of Congress Labs Launches New Tool to Search Visual Content in Historical Newspapers

From LC:

The public can now explore more than 1.5 million historical newspaper images online and free of charge. The latest machine learning experience from LC Labs, Newspaper Navigator allows users to search visual content in American newspapers dating from 1789-1963.

How it Works

The user begins by entering a keyword that returns a selection of photos. Then the user can choose photos to search against, allowing the discovery of related images that were previously undetectable by search engines.

For decades, partners across the United States have collaborated to digitize newspapers through the Library’s Chronicling America website, a database of historical U.S. newspapers. The text of the newspapers is made searchable by character recognition technology, but users looking for specific images were required to page through the individual issues.

A Search Result For “Automobile”

The Developer

Through the creative ingenuity of Innovator in Residence Benjamin Lee and advances in machine learning, Newspaper Navigator now makes images in the newspapers searchable by enabling users to search by visual similarity. To create Newspaper Navigator, Lee trained computer algorithms to sort through 16 million Chronicling America newspaper pages in search of photographs, illustrations, maps, cartoons, comics, headlines and advertisements.

The idea for Lee’s groundbreaking project began with a Library crowdsourcing experiment by 2017 Innovator in Residence Tong Wang called Beyond Words, which invited members of the public to help identify cartoons, illustrations, photographs and advertisements in World War I-era newspapers. Users could draw boxes around visual content on a page, transcribe captions or review other users’ transcriptions.

[Clip]

Dataset Code

While image searching techniques are not new from tech companies, Newspaper Navigator marries cultural heritage with computer science. Users encounter a real-time demonstration of how algorithms are trained to scan millions of pieces of data in seconds. All code used in the project is open source and placed in the public domain for unrestricted re-use. The dataset code can be accessed at github.com/LibraryOfCongress/newspaper-navigator.

Direct to Newspaper Navigator

Learn More, Read the Complete Announcement

Background

Machine Learning: The Library of Congress “Newspaper Navigator” Dataset is Now Available; Over 16 Million Pages From “Chronicling America” Processed (May 2020)

Library of Congress Innovator in Residence Ben Lee Discusses His Newspaper Navigator Project That Uses Machine Learning to Extract Visual Content From Chronicling America & Announces Upcoming “Data Jam” to Preview Dataset (April 2020)

Gary Price About Gary Price

Gary Price (gprice@mediasourceinc.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. Before launching INFOdocket, Price and Shirl Kennedy were the founders and senior editors at ResourceShelf and DocuTicker for 10 years. From 2006-2009 he was Director of Online Information Services at Ask.com, and is currently a contributing editor at Search Engine Land.

Share