August 15, 2020

Machine Learning: The Library of Congress “Newspaper Navigator” Dataset is Now Available; Over 16 Million Pages From “Chronicling America” Processed

About the Newspaper Navigator Dataset (via Github)

The goal of Newspaper Navigator is to re-imagine searching over Chronicling America. The project consists of two stages:

  • Creating the Newspaper Navigator dataset by extracting headlines, photographs, illustrations, maps, comics, cartoons, and advertisements from 16.3 million historic newspaper pages in Chronicling America using emerging machine learning techniques. In addition to the visual content, the dataset will include captions and other relevant text derived from the METS/ALTO OCR, as well as image embeddings for fast similarity querying. 
  • Reimagining an exploratory search interface over the Newspaper Navigator dataset in order to enable new ways for the American public to navigate Chronicling America.

About the Newspaper Navigator Dataset (via Github)

Access the Dataset

Note: The “exploratory search interface” is coming Summer 2020.

Background  About the Project 1 (April 21, 2020)

Library of Congress Innovator in Residence Ben Lee Discusses His Newspaper Navigator Project That Uses Machine Learning to Extract Visual Content From Chronicling America… (April 20, 2020)

Looking through the historic newspapers in Chronicling America, I’m always struck by how visually rich the pages are: beautiful illustrations, fascinating political cartoons, and prominent headlines abound. During my time as an Innovator in Residence at the Library of Congress, I’m developing a project called Newspaper Navigator, the goal of which is to re-imagine how we can explore wonderfully rich visual content in Chronicling America.

[Clip]

The 16+ million historical newspapers within Chronicling America are fascinating to me on so many levels. They are a portal back in time and reveal the rich history of the United States in a way that is unique to historic newspapers, from local histories to fun advertisements. But what excites me most about Chronicling America is how it reaches such a wide range of the American public, including school groups, genealogists, journalists, local historians, researchers, and even people looking to recreate old cooking recipes!

Read the Complete Blog Post

Background About the Project 2 (May 4, 2020) (via arXiv)

The Newspaper Navigator Dataset: Extracting And Analyzing Visual Content from 16 Million Historic Newspaper Pages in Chronicling America
14 pages; PDF.

Chronicling America is a product of the National Digital Newspaper Program, a partnership between the Library of Congress and the National Endowment for the Humanities to digitize historic newspapers. Over 16 million pages of historic American newspapers have been digitized for Chronicling America to date, complete with high-resolution images and machine-readable METS/ALTO OCR. Of considerable interest to Chronicling America users is a semantified corpus, complete with extracted visual content and headlines. To accomplish this, we introduce a visual content recognition model trained on bounding box annotations of photographs, illustrations, maps, comics, and editorial cartoons collected as part of the Library of Congress’s Beyond Words crowdsourcing initiative and augmented with additional annotations including those of headlines and advertisements. We describe our pipeline that utilizes this deep learning model to extract 7 classes of visual content: headlines, photographs, illustrations, maps, comics, editorial cartoons, and advertisements, complete with textual content such as captions derived from the METS/ALTO OCR, as well as image embeddings for fast

Background About the Project 3 (May 8, 2020) (via TechCrunch)

Millions of Historic Newspaper Images Get the Machine Learning Treatment At The Library Of Congress(via TechCrunch)

Using the initial human-powered work of outlining images and captions as training data, they built an AI agent that could do so on its own. After the usual tweaking and optimizing, they set it loose on the full Chronicling America database of newspaper scans.

“It ran for 19 days nonstop — definitely the largest computing job I’ve ever run,” said [Ben] Lee. But the results are remarkable: millions of images spanning three centuries (from 1789 to 1963) and organized with metadata pulled from their own captions.

Read the Complete Article

Direct to Newspaper Navigator Web Page (via LC Labs)

Gary Price About Gary Price

Gary Price (gprice@mediasourceinc.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. Before launching INFOdocket, Price and Shirl Kennedy were the founders and senior editors at ResourceShelf and DocuTicker for 10 years. From 2006-2009 he was Director of Online Information Services at Ask.com, and is currently a contributing editor at Search Engine Land.

Share