Digitizing books published before 1700 has created an aesthetic as well as quite pragmatic “black-dot problem” in translated texts, with the word “love,” for example, showing up as “lo•e.”
Taking the digital savvy of today’s age one step farther, Northwestern University engineering students in the McCormick School of Engineering and Applied Sciences have come to the rescue of the marred and sometimes indecipherable words that populate the translated versions of the early English texts.
Working in conjunction with undergraduates from the Weinberg College of Arts and Sciences, the engineering students designed a computer program that uses language modeling, akin to autocorrect and voice-recognition programs, to help fill in the blanks of the incomplete words.
Since 1999, about 50,000 texts have been transcribed by the non-profit Text Creation Partnership, but the works have roughly 5 million incomplete words. The translations of the tattered books also were further compromised by poor-quality scans.
Language modeling finds misspellings and “blackdot words” created when the computer encounters an unknown character. Once an error is found, nearby characters are evaluated and replacement suggestions are made, with a probability assigned to each option based on the context.
The word “lo•e” might be “love,” but it also might be “lone,” “lore,” or “lose.” A language model uses context to choose the correct option. If the context is “she was in lo•e with him,” then the program assumes the missing word is, indeed, “love.”
Read the Complete Article