Tech Article: "Toward free and searchable historical census images"
Kenton McHenry, Luigi Marini, Mayank Kejriwal, Rob Kooper
National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign
National Institute of Standards and Technology
From the SPIE Newsroom:
In summary, our hybrid automation/crowd-sourcing approach aims to provide search capabilities over the image-based census data, potentially from the day the images are released. However, general difficulties in automating handwriting recognition will limit its accuracy. Incorporation of passive and active crowd-sourcing elements will improve the accuracy of our systems over time. We are currently working on a number of challenges, including further pre-processing of form cells to remove noise. Our next important stage will be to build an index of the ∼7 billion form cells, which is crucial for efficient access. However, of the word-spotting techniques we tested, the best results use a non-linear comparison that does not lend itself to indexing. We are currently investigating alternative methods that are indexable, as well as using high-performance computing resources to perform a one-time, large pre-processing step to hierarchically cluster the data (requiring 4.9×1019 comparisons). Finally, we will investigate how best to associate the passively crowd-sourced transcriptions with the results based on user behavior.
About Gary Price
Gary Price (firstname.lastname@example.org) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. He earned his MLIS degree from Wayne State University in Detroit. Price has won several awards including the SLA Innovations in Technology Award and Alumnus of the Year from the Wayne St. University Library and Information Science Program. From 2006-2009 he was Director of Online Information Services at Ask.com.