Historians and other Humanities’ scholars often have to deal with difficult research objects: centuries-old printed works that are difficult to decipher and often in an unsatisfactory state of conservation. Many of these documents have now been digitized – usually photographed or scanned – and are available online worldwide. For research purposes, this is already a step forward.
However, there is still a challenge to overcome: bringing the digitized old fonts into a modern form with text recognition software that is readable for non-specialists as well as for computers. Scientists at the Center for Philology and Digitality at Julius-Maximilians-Universität Würzburg (JMU) in Bavaria, Germany, have made a significant contribution to further development in this field.
Page from a french version of the “Narrenschiff”. Such old fonts can be reliably converted into computer-readable text with OCR4all. (Source: Dresden State and University Library, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0/deed.de (Image: Staats- und Universitätsbibliothek Dresden, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0/deed.de)
With OCR4all, the JMU research team is making a new tool available to the scientific community. It converts digitized historical prints with an error rate of less than one percent into computer-readable texts. And it offers a graphical user interface that requires no IT expertise. With previous tools of this kind, user-friendliness was not always given as the users mostly had to work with programming commands.
[Clip]
The new OCR4all tool was developed under the direction of Christian Reul together with his computer science colleagues Professor Frank Puppe (Chair of Artificial Intelligence and Applied computer science) and Christoph Wick as well as Uwe Springmann (Digital Humanities expert) and numerous students and assistants.
[Clip]
Christian Reul explains the challenges involved in the development of OCR4all: Automatic text recognition (OCR = Optical Character Recognition) has been working very well for modern fonts for some time now. However, this has not yet been the case for historical fonts.
“One of the biggest problems was typography,” says Reul. One of the reasons for this is that the first printers of the 15th century did not use uniform fonts. “Their printing stamps were all carved by themselves, each printing house practically had its own letters.”
Error rates below one percent
Whether e or c, whether v or r – it is often not easy to distinguish in old prints, but software can learn to recognize such subtleties. To do so, it has to be trained on sample material. In his work, Reul has developed methods to make training more efficient. In a case study with six historical prints from the years 1476 to 1572, the average error rate in automatic text recognition was reduced from 3.9 to 1.7 percent.
Gary Price (gprice@gmail.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area.
He earned his MLIS degree from Wayne State University in Detroit.
Price has won several awards including the SLA Innovations in Technology Award and Alumnus of the Year from the Wayne St. University Library and Information Science Program. From 2006-2009 he was Director of Online Information Services at Ask.com. Gary is also the co-founder of infoDJ an innovation research consultancy supporting corporate product and business model teams with just-in-time fact and insight finding.
From an Ithaka S+R Blog Post by the Report’s Author, Makala Skinner: On Tuesday, January 31, we published the A*CENSUS II Archives Administrators Survey findings. The Archives Administrator Survey Report is ...
From the Urban Libraries Council (ULC): The Urban Libraries Council (ULC) announces today the release of its latest white paper, “Food is a Right: Libraries and Food Justice,” which addresses ...
Annual Report 2022: Highlights from the Data Curation Network arXiv Announces New Policy on ChatGPT and Similar Tools (via arXiv Blog) COPE in 2023 (via Committee on Publication Ethics) eLife’s ...
The article linked to below was today published by Insights. Title A Free Toolkit to Foster Open Access Agreements Authors Alicia Wise Information Power Lorraine Estelle Information Power Source Insights 36 ...
From the Government Publishing Office (GPO): Libraries at the University of Montana, the University of Memphis, and the University of Tennessee, Knoxville have signed Memorandum of Agreements with the U.S. ...
From Fox 17 (Grand Rapids): The folks over at the Grand Rapids Public Library made a fascinating discovery while digging through their massive archives back in March 2021, and are ...
The article linked below was recently published by the International Journal of Communication. Title Knowledge Work in Platform Fact-Checking Partnerships Authors Valérie Bélair-Gagnon University of Minnesota-Twin Cities, USA Rebekah Larsen ...
A Guide to Communicating With Others: Messaging Apps (via Privacy International) De Gruyter Acquires Mercury Learning and Information Report by the French Committee for Open Science Working Group on Electronic ...
From an Internet Archive Blog Post by Jason Scott: It’s time to add another family of emulated older technology to the Internet Archive. The vast majority of platforms within what ...
The article linked below was recently published by Quantitative Science Studies. Title Crossref as a Bibliographic Discovery Tool in the Arts and Humanities Authors Ángel Borrego Universitat de Barcelona, Melcior ...
Colorado: Suspensions Increase at Pikes Peak Library District Under New Security Protocols (via The Gazette) Montana: ImagineIF Trustees Hold Special Meeting on Library Security Concerns (via Daily Inter Mountain) North ...
From the Associated Press: A roundup of some of the most popular but completely untrue stories and visuals of the week. None of these are legit, even though they were ...