From the University of Illinois:
Professor Ted Underwood has received a $73,122 grant from the National Endowment for the Humanities to investigate the consequences of error in digital libraries. While digital libraries represent an immense storehouse of knowledge, the texts are full of errors because of the imperfect process by which they are transcribed optically.
“It isn’t unusual for five percent of the words in volumes to be mistranscribed, with the level of error much higher in some volumes,” said Underwood. “Simply measuring the fraction of mistranscribed words is easy. It’s harder to know how much difference those errors make for the methods and questions that actually interest researchers. Some forms of analysis are undisturbed by high levels of error; others may be quite sensitive, especially when errors are distributed unevenly across different historical periods and genres.”
Underwood will work with graduate students from the iSchool and English Department to construct parallel collections that pair each “clean” text with a realistically error-ridden version of the same book drawn from a digital library. The team will build collections of Chinese texts as well as English texts ranging from 1700 to the present, because different character sets and printing technologies produce different kinds of error. Then the team will apply a wide range of data-mining methods to both the clean and error-ridden collections and measure the distortion produced by transcription error and other common sources of noise. The project will provide tools that help other researchers estimate the level of uncertainty in their own conclusions.