SUBSCRIBE
SUBSCRIBE
EXPLORE +
  • About infoDOCKET
  • Academic Libraries on LJ
  • Research on LJ
  • News on LJ
  • Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Libraries
    • Academic Libraries
    • Government Libraries
    • National Libraries
    • Public Libraries
  • Companies (Publishers/Vendors)
    • EBSCO
    • Elsevier
    • Ex Libris
    • Frontiers
    • Gale
    • PLOS
    • Scholastic
  • New Resources
    • Dashboards
    • Data Files
    • Digital Collections
    • Digital Preservation
    • Interactive Tools
    • Maps
    • Other
    • Podcasts
    • Productivity
  • New Research
    • Conference Presentations
    • Journal Articles
    • Lecture
    • New Issue
    • Reports
  • Topics
    • Archives & Special Collections
    • Associations & Organizations
    • Awards
    • Funding
    • Interviews
    • Jobs
    • Management & Leadership
    • News
    • Patrons & Users
    • Preservation
    • Profiles
    • Publishing
    • Roundup
    • Scholarly Communications
      • Open Access

April 23, 2019 by Gary Price

New Resource: OCR4all (Open Source Text Recognition Software for Historical Texts)

April 23, 2019 by Gary Price

From the University of Würzburg:

Historians and other Humanities’ scholars often have to deal with difficult research objects: centuries-old printed works that are difficult to decipher and often in an unsatisfactory state of conservation. Many of these documents have now been digitized – usually photographed or scanned – and are available online worldwide. For research purposes, this is already a step forward.

However, there is still a challenge to overcome: bringing the digitized old fonts into a modern form with text recognition software that is readable for non-specialists as well as for computers. Scientists at the Center for Philology and Digitality at Julius-Maximilians-Universität Würzburg (JMU) in Bavaria, Germany, have made a significant contribution to further development in this field.

Page from a french version of the “Narrenschiff”. Such old fonts can be reliably converted into computer-readable text with OCR4all. (Source: Dresden State and University Library, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0/deed.de (Image: Staats- und Universitätsbibliothek Dresden, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0/deed.de)

With OCR4all, the JMU research team is making a new tool available to the scientific community. It converts digitized historical prints with an error rate of less than one percent into computer-readable texts. And it offers a graphical user interface that requires no IT expertise. With previous tools of this kind, user-friendliness was not always given as the users mostly had to work with programming commands.

[Clip]

The new OCR4all tool was developed under the direction of Christian Reul together with his computer science colleagues Professor Frank Puppe (Chair of Artificial Intelligence and Applied computer science) and Christoph Wick as well as Uwe Springmann (Digital Humanities expert) and numerous students and assistants.

[Clip]

Christian Reul explains the challenges involved in the development of OCR4all: Automatic text recognition (OCR = Optical Character Recognition) has been working very well for modern fonts for some time now. However, this has not yet been the case for historical fonts.

“One of the biggest problems was typography,” says Reul. One of the reasons for this is that the first printers of the 15th century did not use uniform fonts. “Their printing stamps were all carved by themselves, each printing house practically had its own letters.”

Error rates below one percent

Whether e or c, whether v or r – it is often not easy to distinguish in old prints, but software can learn to recognize such subtleties. To do so, it has to be trained on sample material. In his work, Reul has developed methods to make training more efficient. In a case study with six historical prints from the years 1476 to 1572, the average error rate in automatic text recognition was reduced from 3.9 to 1.7 percent.

Resources

Full Text of Announcement, Links to Additional Resources

Direct to OCR4all on GitHub

Filed under: Academic Libraries, Libraries, News, Patrons and Users, Preservation

SHARE:

About Gary Price

Gary Price (gprice@mediasourceinc.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. Before launching INFOdocket, Price and Shirl Kennedy were the founders and senior editors at ResourceShelf and DocuTicker for 10 years. From 2006-2009 he was Director of Online Information Services at Ask.com, and is currently a contributing editor at Search Engine Land.

ADVERTISEMENT

Archives

Job Zone

ADVERTISEMENT

Recent Articles on LJ

After the MLIS

Proud Boys Disrupt Drag Queen Story Time at San Lorenzo Library

How Ted Lasso Changed My Librarianship | Backtalk

Dartmouth Repatriates Samson Occom Papers to Mohegan Tribe

Prince George’s County Memorial Library System Targeted by Anti-LGBTQIA+ Vandalism

ADVERTISEMENT

Related Infodocket Posts

ACLU: "It’s 2022 and Two Books Are on Trial for 'Obscenity'"

From the ACLU: …last month, a Virginia resident initiated obscenity proceedings against two acclaimed books: Gender Queer, a Memoir, by Maia Kobabe, an autobiographical graphic novel that depicts the author’s ...

U.S. Patent Research: USPTO Announces Patent Center to Fully Replace Legacy Public PAIR System This Summer

From the U.S. Patent and Trademark Office: Beginning August 1, 2022, the U.S. Patent and Trademark Office’s (USPTO) Patent Center system—available to the public since 2017—will fully replace the legacy ...

Roundup (June 29, 2022)

CORE: Our Commitment to The Principles of Open Scholarly Infrastructure Elsevier’s Acquisition of Interfolio: Risks and Responses GPO to Discontinue Assigning Library of Congress Classification Numbers in Records for Hearings ...

Nat Geo Report: "The Great Hunt for the World's First LGBTQ Archive"

From National Geographic: In the early 1990s, a Canadian student named Adam Smith opened a dumpster in the basement of his apartment building in Vancouver, Canada, and discovered a stack ...

2022 Google Scholar Metrics Released

From the Google Scholar Blog: Scholar Metrics provide an easy way for authors to quickly gauge the visibility and influence of rec. This release covers articles published in 2017–2021 and ...

New Video Recording From Rare Book School: "Making and Reading Indigenous Archives"

The Rare Book School (U. of Virginia) video embedded below (a National Endowment for the Humanities-Global Book Histories Initiative Lecture by Kelly Wisecup) was recorded on June 15, 2022. From ...

Wide Web Consortium (W3C) to Become a Public-Interest Non-Profit Organization

From a W3C Release: The World Wide Web Consortium is set to pursue 501(c)(3) non-profit status. The launch as a new legal entity in January 2023 preserves the core mission ...

Julie Mosbo Ballestro Appointed University Librarian at Texas A&M University

Full Text of a Texas A&M University Libraries Announcement: We are pleased to announce the appointment of Julie Mosbo Ballestro as University Librarian and Assistant Provost of University Libraries at ...

New Report From EBLIDA: "First European Overview on E-Lending in Public Libraries"

From an EBLIDA (European Bureau of Library, Information and Documentation Associations) Post: EBLIDA is laying the foundation for “sustainable copyright” in public libraries through the publication of the “First European ...

New Funding: Digital Public Library of America (DPLA) Awarded $850,000 by Mellon Foundation to Support the Advancement of...

From a DPLA Announcement: Digital Public Library of America (DPLA) is pleased to announce an $850,000 grant from the Mellon Foundation to support its effort to advance racial justice in ...

New From COPIM: "WP7 Scoping Report on Archiving and Preserving OA Monographs"

From the Report: Technical methods for effectively archiving complex digital research publications and for creating an integrated collections of content in different formats have not yet been developed. As part ...

Roundup (June 27, 2022)

Coherent Digital Launches South Asia Archive on the Coherent Commons Platform The Longest-Running Queer News Radio Show Is Headed to the Library of Congress (via NPR) University of Cambridge Now ...

ADVERTISEMENT

FOLLOW INFODOCKET ON TWITTER

Tweets by @infodocket

ADVERTISEMENT

This coverage is free for all visitors. Your support makes this possible.

This coverage is free for all visitors. Your support makes this possible.

Primary Sidebar

  • News
  • Reviews+
  • Technology
  • Programs+
  • Design
  • Leadership
  • People
  • COVID-19
  • Advocacy
  • Opinion
  • INFOdocket
  • Job Zone

Reviews+

  • Booklists
  • Prepub Alert
  • Book Pulse
  • Media
  • Readers' Advisory
  • Self-Published Books
  • Review Submissions
  • Review for LJ

Awards

  • Library of the Year
  • Librarian of the Year
  • Movers & Shakers 2022
  • Paralibrarian of the Year
  • Best Small Library
  • Marketer of the Year
  • All Awards Guidelines
  • Community Impact Prize

Resources

  • LJ Index/Star Libraries
  • Research
  • White Papers / Case Studies

Events & PD

  • Online Courses
  • In-Person Events
  • Virtual Events
  • Webcasts
  • About Us
  • Contact Us
  • Advertise
  • Subscribe
  • Media Inquiries
  • Newsletter Sign Up
  • Submit Features/News
  • Data Privacy
  • Terms of Use
  • Terms of Sale
  • FAQs
  • Careers at MSI


© 2022 Library Journal. All rights reserved.


© 2022 Library Journal. All rights reserved.