AP: “AI Chatbots Need More Books to Learn From. These Libraries are Opening Their Stacks”
From the Associated Press:
Nearly one million books published as early as the 15th century — and in 254 languages — are part of a Harvard University collection being released to AI researchers Thursday. Also coming soon are troves of old newspapers and government documents held by Boston’s public library.
[Clip]
“It is a prudent decision to start with public domain data because that’s less controversial right now than content that’s still under copyright,” said Burton Davis, a deputy general counsel at Microsoft.
[Clip]
Now, for the first time, Google has worked with Harvard to retrieve public domain volumes from Google Books and clear the way for their release to AI developers. Copyright protections in the U.S. typically last for 95 years, and longer for sound recordings.
How useful all of this will be for the next generation of AI tools remains to be seen as the data gets shared Thursday on the Hugging Face platform, which hosts datasets and open-source AI models that anyone can download.
[Clip]
Harvard’s newly released dataset, Institutional Books 1.0, contains more than 394 million scanned pages of paper. One of the earlier works is from the 1400s — a Korean painter’s handwritten thoughts about cultivating flowers and trees. The largest concentration of works is from the 19th century, on subjects such as literature, philosophy, law and agriculture, all of it meticulously preserved and organized by generations of librarians.
[Clip]
Harvard’s new AI training collection has an estimated 242 billion tokens, an amount that’s hard for humans to fathom but it’s still just a drop of what’s being fed into the most advanced AI systems.
Learn More, Read the Complete Article (about 1130 words)
Direct to Research Article (preprint): Institutional Books 1.0: A 242B Token Dataset From Harvard Library’s Collections, Refined For Accuracy and Usability (via arXiv)
Direct to Dataset/Info (via HuggingFace)
Note: To learn more about the project discussed in the AP article see this infoDOCKET post from December 12, 2024: A New Research Initiative From the Harvard Law School Library: The Institutional Data Initiative (IDI) Launches Today
See Also: Another Project of Possible Interest
- The Public Interest Corpus Update – Boston Edition (via Authors Alliance; March 26, 2025)
- Developing a Public-Interest Training Commons of Books (December 24, 2024)
- Public Interest Corpus Website/Info
and Another
Filed under: Data Files, Journal Articles, Libraries, News, Public Libraries, School Libraries
About Gary Price
Gary Price (gprice@gmail.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. He earned his MLIS degree from Wayne State University in Detroit. Price has won several awards including the SLA Innovations in Technology Award and Alumnus of the Year from the Wayne St. University Library and Information Science Program. From 2006-2009 he was Director of Online Information Services at Ask.com.


Nearly one million books published as early as the 15th century — and in 254 languages — are part of a Harvard University collection 