Northwestern University Engineering Students Fix Common Glitch in Digitization of Books Published Before 1700
Digitizing books published before 1700 has created an aesthetic as well as quite pragmatic “black-dot problem” in translated texts, with the word “love,” for example, showing up as “lo•e.”
Taking the digital savvy of today’s age one step farther, Northwestern University engineering students in the McCormick School of Engineering and Applied Sciences have come to the rescue of the marred and sometimes indecipherable words that populate the translated versions of the early English texts.
Working in conjunction with undergraduates from the Weinberg College of Arts and Sciences, the engineering students designed a computer program that uses language modeling, akin to autocorrect and voice-recognition programs, to help fill in the blanks of the incomplete words.
Since 1999, about 50,000 texts have been transcribed by the non-profit Text Creation Partnership, but the works have roughly 5 million incomplete words. The translations of the tattered books also were further compromised by poor-quality scans.
Language modeling finds misspellings and “blackdot words” created when the computer encounters an unknown character. Once an error is found, nearby characters are evaluated and replacement suggestions are made, with a probability assigned to each option based on the context.
The word “lo•e” might be “love,” but it also might be “lone,” “lore,” or “lose.” A language model uses context to choose the correct option. If the context is “she was in lo•e with him,” then the program assumes the missing word is, indeed, “love.”
Read the Complete Article
About Gary Price
Gary Price (email@example.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. He earned his MLIS degree from Wayne State University in Detroit. Price has won several awards including the SLA Innovations in Technology Award and Alumnus of the Year from the Wayne St. University Library and Information Science Program. From 2006-2009 he was Director of Online Information Services at Ask.com. Gary is also the co-founder of infoDJ an innovation research consultancy supporting corporate product and business model teams with just-in-time fact and insight finding.