December 2, 2020

Tech Article: "Toward free and searchable historical census images"

Title: “Toward free and searchable historical census images”

By:
Kenton McHenry, Luigi Marini, Mayank Kejriwal, Rob Kooper
National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign
Urbana, IL

and

Peter Bajcsy

National Institute of Standards and Technology

Gaithersburg, MD

From the SPIE Newsroom:

In summary, our hybrid automation/crowd-sourcing approach aims to provide search capabilities over the image-based census data, potentially from the day the images are released. However, general difficulties in automating handwriting recognition will limit its accuracy. Incorporation of passive and active crowd-sourcing elements will improve the accuracy of our systems over time. We are currently working on a number of challenges, including further pre-processing of form cells to remove noise. Our next important stage will be to build an index of the ∼7 billion form cells, which is crucial for efficient access. However, of the word-spotting techniques we tested, the best results use a non-linear comparison that does not lend itself to indexing. We are currently investigating alternative methods that are indexable, as well as using high-performance computing resources to perform a one-time, large pre-processing step to hierarchically cluster the data (requiring 4.9×1019 comparisons). Finally, we will investigate how best to associate the passively crowd-sourced transcriptions with the results based on user behavior.

About Gary Price

Gary Price (gprice@mediasourceinc.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. Before launching INFOdocket, Price and Shirl Kennedy were the founders and senior editors at ResourceShelf and DocuTicker for 10 years. From 2006-2009 he was Director of Online Information Services at Ask.com, and is currently a contributing editor at Search Engine Land.

Share