Brewster Kahle Thanks Free and Open Source Communities For Help with the Digitization of 19th Century Newspapers (and Other Materials)
I have never been more encouraged and thankful to Free and Open Source communities. Three months ago I posted a request for help with OCR’ing and processing 19th Century Newspapers and we got soooo many offers to help. Thank you, that was heart warming and concretely helpful– already based on these suggestions we are changing over our OCR and PDF software completely to FOSS, making big improvements, and building partnerships with FOSS developers in companies, universities, and as individuals that will propel the Internet Archive to have much better digitized texts. I am so grateful, thank you. So encouraging.
Several people suggested the German government-lead initiative called OCR-D that has made production level tools for helping OCR and segment complex and old materials such as newspapers in the old German script Fraktur, or black letter. (The Internet Archive had never been able to process these, and now we are doing it at scale). We are also able to OCR more Indian languages which is fantastic. This Government project is FOSS, and has money for outreach to make sure others use the tools– this is a step beyond most research grants.
Tesseract has made a major step forward in the last few years. When we last evaluated the accuracy it was not as good as the proprietary OCR, but that has changed– we have done evaluations and it is just as good, and can get better for our application because of its new architecture.
About Gary Price
Gary Price (firstname.lastname@example.org) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. Before launching INFOdocket, Price and Shirl Kennedy were the founders and senior editors at ResourceShelf and DocuTicker for 10 years. From 2006-2009 he was Director of Online Information Services at Ask.com, and is currently a contributing editor at Search Engine Land.