SUBSCRIBE
SUBSCRIBE
EXPLORE +
  • About infoDOCKET
  • Academic Libraries on LJ
  • Research on LJ
  • News on LJ
  • Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Libraries
    • Academic Libraries
    • Government Libraries
    • National Libraries
    • Public Libraries
  • Companies (Publishers/Vendors)
    • EBSCO
    • Elsevier
    • Ex Libris
    • Frontiers
    • Gale
    • PLOS
    • Scholastic
  • New Resources
    • Dashboards
    • Data Files
    • Digital Collections
    • Digital Preservation
    • Interactive Tools
    • Maps
    • Other
    • Podcasts
    • Productivity
  • New Research
    • Conference Presentations
    • Journal Articles
    • Lecture
    • New Issue
    • Reports
  • Topics
    • Archives & Special Collections
    • Associations & Organizations
    • Awards
    • Funding
    • Interviews
    • Jobs
    • Management & Leadership
    • News
    • Patrons & Users
    • Preservation
    • Profiles
    • Publishing
    • Roundup
    • Scholarly Communications
      • Open Access

May 7, 2020 by Gary Price

Machine Learning: The Library of Congress “Newspaper Navigator” Dataset is Now Available; Over 16 Million Pages From “Chronicling America” Processed

May 7, 2020 by Gary Price

About the Newspaper Navigator Dataset (via Github)

The goal of Newspaper Navigator is to re-imagine searching over Chronicling America. The project consists of two stages:

  • Creating the Newspaper Navigator dataset by extracting headlines, photographs, illustrations, maps, comics, cartoons, and advertisements from 16.3 million historic newspaper pages in Chronicling America using emerging machine learning techniques. In addition to the visual content, the dataset will include captions and other relevant text derived from the METS/ALTO OCR, as well as image embeddings for fast similarity querying. 
  • Reimagining an exploratory search interface over the Newspaper Navigator dataset in order to enable new ways for the American public to navigate Chronicling America.

About the Newspaper Navigator Dataset (via Github)

Access the Dataset

Note: The “exploratory search interface” is coming Summer 2020.

Background  About the Project 1 (April 21, 2020)

Library of Congress Innovator in Residence Ben Lee Discusses His Newspaper Navigator Project That Uses Machine Learning to Extract Visual Content From Chronicling America… (April 20, 2020)

Looking through the historic newspapers in Chronicling America, I’m always struck by how visually rich the pages are: beautiful illustrations, fascinating political cartoons, and prominent headlines abound. During my time as an Innovator in Residence at the Library of Congress, I’m developing a project called Newspaper Navigator, the goal of which is to re-imagine how we can explore wonderfully rich visual content in Chronicling America.

[Clip]

The 16+ million historical newspapers within Chronicling America are fascinating to me on so many levels. They are a portal back in time and reveal the rich history of the United States in a way that is unique to historic newspapers, from local histories to fun advertisements. But what excites me most about Chronicling America is how it reaches such a wide range of the American public, including school groups, genealogists, journalists, local historians, researchers, and even people looking to recreate old cooking recipes!

Read the Complete Blog Post

Background About the Project 2 (May 4, 2020) (via arXiv)

The Newspaper Navigator Dataset: Extracting And Analyzing Visual Content from 16 Million Historic Newspaper Pages in Chronicling America
14 pages; PDF.

Chronicling America is a product of the National Digital Newspaper Program, a partnership between the Library of Congress and the National Endowment for the Humanities to digitize historic newspapers. Over 16 million pages of historic American newspapers have been digitized for Chronicling America to date, complete with high-resolution images and machine-readable METS/ALTO OCR. Of considerable interest to Chronicling America users is a semantified corpus, complete with extracted visual content and headlines. To accomplish this, we introduce a visual content recognition model trained on bounding box annotations of photographs, illustrations, maps, comics, and editorial cartoons collected as part of the Library of Congress’s Beyond Words crowdsourcing initiative and augmented with additional annotations including those of headlines and advertisements. We describe our pipeline that utilizes this deep learning model to extract 7 classes of visual content: headlines, photographs, illustrations, maps, comics, editorial cartoons, and advertisements, complete with textual content such as captions derived from the METS/ALTO OCR, as well as image embeddings for fast

Background About the Project 3 (May 8, 2020) (via TechCrunch)

Millions of Historic Newspaper Images Get the Machine Learning Treatment At The Library Of Congress(via TechCrunch)

Using the initial human-powered work of outlining images and captions as training data, they built an AI agent that could do so on its own. After the usual tweaking and optimizing, they set it loose on the full Chronicling America database of newspaper scans.

“It ran for 19 days nonstop — definitely the largest computing job I’ve ever run,” said [Ben] Lee. But the results are remarkable: millions of images spanning three centuries (from 1789 to 1963) and organized with metadata pulled from their own captions.

Read the Complete Article

Direct to Newspaper Navigator Web Page (via LC Labs)

Filed under: Data Files, Libraries, Maps, News, Patrons and Users

SHARE:

About Gary Price

Gary Price (gprice@gmail.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. He earned his MLIS degree from Wayne State University in Detroit. Price has won several awards including the SLA Innovations in Technology Award and Alumnus of the Year from the Wayne St. University Library and Information Science Program. From 2006-2009 he was Director of Online Information Services at Ask.com. Gary is also the co-founder of infoDJ an innovation research consultancy supporting corporate product and business model teams with just-in-time fact and insight finding.

ADVERTISEMENT

Archives

Job Zone

ADVERTISEMENT

Related Infodocket Posts

UC Berkeley School of Law Library Reclassifies Indigenous Materials, Giving Them Their Own Place on the Shelves

From Berkeley Law: As part of its broader commitment to considering and fostering diversity and inclusion within its storied stacks, the Berkeley Law Library staff have taken on one prominent example of ...

Not Real News: An Associated Press Roundup of Untrue Stories Shared Widely on Social Media This Week

From the Associated Press: A roundup of some of the most popular but completely untrue stories and visuals of the week. None of these are legit, even though they were ...

A Selection of New or Recently Updated Reports From the Congressional Research Service

An Introduction to Trade Secrets Law in the United States Oil and Gas Technology and Geothermal Energy Development Regulating Big Tech: CRS Legal Products for the 118th Congress Rules and ...

Deepfakes are Becoming a Cottage Industry; STM US Annual Conference 2023 to Take Place in DC (April 26-27);...

Columbia: A Judge Just Used ChatGPT to Make a Court Decision (via VICE) Coming Soon: STM US Annual Conference 2023 to Take Place in DC (April 26-27) FCC Announces Over ...

New Journal Article: "Sustainability 3.0 in Libraries: A Challenge for Management"

The article linked below was published today (February 3, 2023). Title Sustainability 3.0 in Libraries: A Challenge for Management Author Alice Keller University Library Basel, University of Basel,  Switzerland Source ...

U.S. National Academy of Sciences and Nobel Foundation to Hold Nobel Prize Summit on Countering Misinformation and Building...

From a National Academies Announcement: The Nobel Prize Summit Truth, Trust and Hope will bring together Nobel Prize laureates and other world-renowned experts and leaders for a global dialogue on how to stop ...

With Support From the Arcadia Fund, MIT Press Announces New Initiative to Flip Existing Subscription-Based Journals to a...

From a MIT Press Announcement:  In keeping with its mission and longstanding commitment to increase access to scholarship, the MIT Press is pleased to announce shift+OPEN. This new initiative is designed ...

A New EPUB Reader For E-Books From the Library of Congress Open Access Books Collection 

From a Library of Congress Blog Post: The Open Access Books Collection on loc.gov includes approximately 6,000 contemporary open access e-books covering a wide range of subjects, including history, music, poetry, technology, and works ...

Panel Discussion Video Recording: "Internet Freedom: Information Communication, Accessibility and Archiving"

The panel discussion video recording embedded below from the Oxford Internet Institute (OII) was recorded on February 1, 2023.  Description This is a discussion on censorship-resistance, web archiving and ensuring ...

RLUK Releases Community-Driven Toolkit for the Development and Delivery of Virtual Reading Rooms (VRRs)

From RLUK (Research Libraries UK): The Virtual Reading Rooms (VRRs) Toolkit is a resource for all collection-holding institutions, including libraries, archives, and museums, which are interested in setting up a VRR consultation ...

Microsoft Bing to Rely on GPT-4, ChatGPT Mobile App Planned, Rumours Say; Senator Calls on Apple and Google...

Microsoft Bing to Rely on GPT-4, ChatGPT Mobile App Planned, Rumours Say (via The Decoder) & Microsoft Teams gets an AI upgrade with OpenAI’s GPT 3.5 (via The Decoder) Resources ...

Library of Congress Opens New Web Archive Collection Documenting Protests Against Racism & Learn About LC's Black History...

From the Library of Congress (Full Text of Announcement): A new web archive collection from the Library of Congress documents the civil unrest sparked by the police murder of George ...

ADVERTISEMENT

FOLLOW US ON TWITTER

Tweets by infoDOCKET

ADVERTISEMENT

This coverage is free for all visitors. Your support makes this possible.

This coverage is free for all visitors. Your support makes this possible.

Primary Sidebar

  • News
  • Reviews+
  • Technology
  • Programs+
  • Design
  • Leadership
  • People
  • COVID-19
  • Advocacy
  • Opinion
  • INFOdocket
  • Job Zone

Reviews+

  • Booklists
  • Prepub Alert
  • Book Pulse
  • Media
  • Readers' Advisory
  • Self-Published Books
  • Review Submissions
  • Review for LJ

Awards

  • Library of the Year
  • Librarian of the Year
  • Movers & Shakers 2022
  • Paralibrarian of the Year
  • Best Small Library
  • Marketer of the Year
  • All Awards Guidelines
  • Community Impact Prize

Resources

  • LJ Index/Star Libraries
  • Research
  • White Papers / Case Studies

Events & PD

  • Online Courses
  • In-Person Events
  • Virtual Events
  • Webcasts
  • About Us
  • Contact Us
  • Advertise
  • Subscribe
  • Media Inquiries
  • Newsletter Sign Up
  • Submit Features/News
  • Data Privacy
  • Terms of Use
  • Terms of Sale
  • FAQs
  • Careers at MSI


© 2023 Library Journal. All rights reserved.


© 2022 Library Journal. All rights reserved.