Professor Ted Underwood has received a $73,122 grant from the National Endowment for the Humanities to investigate the consequences of error in digital libraries. While digital libraries represent an immense storehouse of knowledge, the texts are full of errors because of the imperfect process by which they are transcribed optically.
“It isn’t unusual for five percent of the words in volumes to be mistranscribed, with the level of error much higher in some volumes,” said Underwood. “Simply measuring the fraction of mistranscribed words is easy. It’s harder to know how much difference those errors make for the methods and questions that actually interest researchers. Some forms of analysis are undisturbed by high levels of error; others may be quite sensitive, especially when errors are distributed unevenly across different historical periods and genres.”
Underwood will work with graduate students from the iSchool and English Department to construct parallel collections that pair each “clean” text with a realistically error-ridden version of the same book drawn from a digital library. The team will build collections of Chinese texts as well as English texts ranging from 1700 to the present, because different character sets and printing technologies produce different kinds of error. Then the team will apply a wide range of data-mining methods to both the clean and error-ridden collections and measure the distortion produced by transcription error and other common sources of noise. The project will provide tools that help other researchers estimate the level of uncertainty in their own conclusions.
Gary Price (gprice@gmail.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area.
He earned his MLIS degree from Wayne State University in Detroit.
Price has won several awards including the SLA Innovations in Technology Award and Alumnus of the Year from the Wayne St. University Library and Information Science Program. From 2006-2009 he was Director of Online Information Services at Ask.com. Gary is also the co-founder of infoDJ an innovation research consultancy supporting corporate product and business model teams with just-in-time fact and insight finding.
From a Library of Congress Blog Post: The Open Access Books Collection on loc.gov includes approximately 6,000 contemporary open access e-books covering a wide range of subjects, including history, music, poetry, technology, and works ...
The panel discussion video recording embedded below from the Oxford Internet Institute (OII) was recorded on February 1, 2023. Description This is a discussion on censorship-resistance, web archiving and ensuring ...
From RLUK (Research Libraries UK): The Virtual Reading Rooms (VRRs) Toolkit is a resource for all collection-holding institutions, including libraries, archives, and museums, which are interested in setting up a VRR consultation ...
Microsoft Bing to Rely on GPT-4, ChatGPT Mobile App Planned, Rumours Say (via The Decoder) & Microsoft Teams gets an AI upgrade with OpenAI’s GPT 3.5 (via The Decoder) Resources ...
From the Library of Congress (Full Text of Announcement): A new web archive collection from the Library of Congress documents the civil unrest sparked by the police murder of George ...
From an arXiv Blog Post: The recent release of AI technology that generates new text has raised serious questions among the research community. For one, “Can ChatGPT be named an ...
From a Joint Statement (via De Gruyter): ResearchGate, the professional network for researchers, and De Gruyter, an independent academic publisher, have today announced a content syndication partnership that will see ...
ARL: Celebrating Black History Month 2023 EveryLibrary Releases 2022 Annual Report ||| Full Text Report Germany: DFG Launches Cooperation with the OAPEN Foundation IFLA: Applications for Public Library of the ...
From an Ithaka S+R Blog Post by the Report’s Author, Makala Skinner: On Tuesday, January 31, we published the A*CENSUS II Archives Administrators Survey findings. The Archives Administrator Survey Report is ...
From the Urban Libraries Council (ULC): The Urban Libraries Council (ULC) announces today the release of its latest white paper, “Food is a Right: Libraries and Food Justice,” which addresses ...
Annual Report 2022: Highlights from the Data Curation Network arXiv Announces New Policy on ChatGPT and Similar Tools (via arXiv Blog) COPE in 2023 (via Committee on Publication Ethics) eLife’s ...
The article linked to below was today published by Insights. Title A Free Toolkit to Foster Open Access Agreements Authors Alicia Wise Information Power Lorraine Estelle Information Power Source Insights 36 ...