Kenton McHenry, Luigi Marini, Mayank Kejriwal, Rob Kooper
National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign
National Institute of Standards and Technology
From the SPIE Newsroom:
In summary, our hybrid automation/crowd-sourcing approach aims to provide search capabilities over the image-based census data, potentially from the day the images are released. However, general difficulties in automating handwriting recognition will limit its accuracy. Incorporation of passive and active crowd-sourcing elements will improve the accuracy of our systems over time. We are currently working on a number of challenges, including further pre-processing of form cells to remove noise. Our next important stage will be to build an index of the ∼7 billion form cells, which is crucial for efficient access. However, of the word-spotting techniques we tested, the best results use a non-linear comparison that does not lend itself to indexing. We are currently investigating alternative methods that are indexable, as well as using high-performance computing resources to perform a one-time, large pre-processing step to hierarchically cluster the data (requiring 4.9×1019 comparisons). Finally, we will investigate how best to associate the passively crowd-sourced transcriptions with the results based on user behavior.