January 21, 2021

Northwestern University Engineering Students Fix Common Glitch in Digitization of Books Published Before 1700

From Northwestern U. News:

Digitizing books published before 1700 has created an aesthetic as well as quite pragmatic “black-dot problem” in translated texts, with the word “love,” for example, showing up as “lo•e.”

Taking the digital savvy of today’s age one step farther, Northwestern University engineering students in the McCormick School of Engineering and Applied Sciences have come to the rescue of the marred and sometimes indecipherable words that populate the translated versions of the early English texts.

Working in conjunction with undergraduates from the Weinberg College of Arts and Sciences, the engineering students designed a computer program that uses language modeling, akin to autocorrect and voice-recognition programs, to help fill in the blanks of the incomplete words.

[Clip]

Since 1999, about 50,000 texts have been transcribed by the non-profit Text Creation Partnership, but the works have roughly 5 million incomplete words. The translations of the tattered books also were further compromised by poor-quality scans.

Language modeling finds misspellings and “blackdot words” created when the computer encounters an unknown character. Once an error is found, nearby characters are evaluated and replacement suggestions are made, with a probability assigned to each option based on the context.

The word “lo•e” might be “love,” but it also might be “lone,” “lore,” or “lose.” A language model uses context to choose the correct option. If the context is “she was in lo•e with him,” then the program assumes the missing word is, indeed, “love.”

Read the Complete Article

Even More in: “Dirty Words” via McCormick School of Engineering Magazine

See Also: ProQuest, University of Michigan Library and Bodleian Libraries Provide 25,000 Early Modern Books as Open Access Text (January 27, 2015)

About Gary Price

Gary Price (gprice@mediasourceinc.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. Before launching INFOdocket, Price and Shirl Kennedy were the founders and senior editors at ResourceShelf and DocuTicker for 10 years. From 2006-2009 he was Director of Online Information Services at Ask.com, and is currently a contributing editor at Search Engine Land.

Share