January 26, 2022

Web Archiving: Cornell Selects Archive-It to Capture and Preserve 8 Million University Web Pages

From The Cornell Daily Sun:

Internet Archive will begin preserving Cornell’s online content starting this month after the University signed a contract with the Internet archiving company in March.

Internet Archive will create an archive of Cornell’s entire web space — approximately eight million documents — by capturing HTML coding, images, PDFs and links to external pages, according to Dean Krafft, Cornell library chief technology strategist, who is overseeing the project.

Cornell workers are beginning to use Internet Archive’s “Archive-It” function to make test scans, or “crawls,” of Cornell’s Internet domain, Krafft said. A complete crawl of the Cornell domain will occur two to three times a year, with the first one scheduled to take place within the next month, he said.


Cornell previously partnered with Archive-It in 2009 to provide nearly 80,000 free online books to the public, according to a press release by Cornell Libraries.


Kristine Hanna, Internet Archive’s director of archiving services said that about 90 university libraries use the Archive-It service to collect and archive digital content.

Cornell’s archived web pages will be available publicly on Archive-It.org, giving people access to information that may no longer be available as a result of updates or removal of pages, Earle said.

We’ve been and continue to be major fans of the Archive-It service and the Internet Archive. Here’s are two reasons why.

1. As the article points out the Cornell collection will be available on the web along with those from many other organizations (not only higher-ed).

2. A feature that Archive-It collections offer vs. The Wayback Machine (an essential tool also from Internet Archive) is that they’re full text searchable. Nice!

See Also: New: Japan Earthquake 2011 Web Archive
From the Internet Archive/Archive-It

About Gary Price

Gary Price (gprice@mediasourceinc.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. Before launching INFOdocket, Price and Shirl Kennedy were the founders and senior editors at ResourceShelf and DocuTicker for 10 years. From 2006-2009 he was Director of Online Information Services at Ask.com, and is currently a contributing editor at Search Engine Land.