Machine Learning: The Library of Congress “Newspaper Navigator” Dataset is Now Available; Over 16 Million Pages From “Chronicling America” Processed
The goal of Newspaper Navigator is to re-imagine searching over Chronicling America. The project consists of two stages:
- Creating the Newspaper Navigator dataset by extracting headlines, photographs, illustrations, maps, comics, cartoons, and advertisements from 16.3 million historic newspaper pages in Chronicling America using emerging machine learning techniques. In addition to the visual content, the dataset will include captions and other relevant text derived from the METS/ALTO OCR, as well as image embeddings for fast similarity querying.
- Reimagining an exploratory search interface over the Newspaper Navigator dataset in order to enable new ways for the American public to navigate Chronicling America.
Note: The “exploratory search interface” is coming Summer 2020.
Background About the Project 1 (April 21, 2020)
Looking through the historic newspapers in Chronicling America, I’m always struck by how visually rich the pages are: beautiful illustrations, fascinating political cartoons, and prominent headlines abound. During my time as an Innovator in Residence at the Library of Congress, I’m developing a project called Newspaper Navigator, the goal of which is to re-imagine how we can explore wonderfully rich visual content in Chronicling America.
The 16+ million historical newspapers within Chronicling America are fascinating to me on so many levels. They are a portal back in time and reveal the rich history of the United States in a way that is unique to historic newspapers, from local histories to fun advertisements. But what excites me most about Chronicling America is how it reaches such a wide range of the American public, including school groups, genealogists, journalists, local historians, researchers, and even people looking to recreate old cooking recipes!
Read the Complete Blog Post
Background About the Project 2 (May 4, 2020) (via arXiv)
Chronicling America is a product of the National Digital Newspaper Program, a partnership between the Library of Congress and the National Endowment for the Humanities to digitize historic newspapers. Over 16 million pages of historic American newspapers have been digitized for Chronicling America to date, complete with high-resolution images and machine-readable METS/ALTO OCR. Of considerable interest to Chronicling America users is a semantified corpus, complete with extracted visual content and headlines. To accomplish this, we introduce a visual content recognition model trained on bounding box annotations of photographs, illustrations, maps, comics, and editorial cartoons collected as part of the Library of Congress’s Beyond Words crowdsourcing initiative and augmented with additional annotations including those of headlines and advertisements. We describe our pipeline that utilizes this deep learning model to extract 7 classes of visual content: headlines, photographs, illustrations, maps, comics, editorial cartoons, and advertisements, complete with textual content such as captions derived from the METS/ALTO OCR, as well as image embeddings for fast
Background About the Project 3 (May 8, 2020) (via TechCrunch)
Using the initial human-powered work of outlining images and captions as training data, they built an AI agent that could do so on its own. After the usual tweaking and optimizing, they set it loose on the full Chronicling America database of newspaper scans.
“It ran for 19 days nonstop — definitely the largest computing job I’ve ever run,” said [Ben] Lee. But the results are remarkable: millions of images spanning three centuries (from 1789 to 1963) and organized with metadata pulled from their own captions.
Read the Complete Article
About Gary Price
Gary Price (firstname.lastname@example.org) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. He earned his MLIS degree from Wayne State University in Detroit. Price has won several awards including the SLA Innovations in Technology Award and Alumnus of the Year from the Wayne St. University Library and Information Science Program. From 2006-2009 he was Director of Online Information Services at Ask.com. Gary is also the co-founder of infoDJ an innovation research consultancy supporting corporate product and business model teams with just-in-time fact and insight finding.