May 24, 2022

GPO Publishes Update About Web Harvesting Pilot Project

The March 2013 issue of the Federal Depository Library Program (FDLP) includes an update about the Government Printing Office (GPO) web harvesting pilot.

From the Article:

In late 2011, Library Services and Content Management (LSCM) and OAM staff developed a pilot project to test an implementation of the Internet Archive’s Heritrix-based Archive-It, which is a subscription-based Web harvesting and archiving service. In developing the pilot project, the project team networked with Web harvesting teams from the Library of Congress, the National Archives and Records Administration, and the University of North Texas Library (a GPO library partner already well-known for establishing the CyberCemetery and its leadership in digital preservation initiatives).

While each of these GPO partners and more than 228 libraries and agencies had proven the basic concept and viability of Heritrix and Archive-It, the Web Harvesting Task Force was charged with determining whether Archive-It would work within LSCM’s operational budget and staffing parameters.

Test crawls were conducted on ten test Web sites, and the resulting facsimile harvested copies were reviewed for performance. MARC records were created in the CGP by performing a crosswalk from the Archive-It Dublin Core metadata to MARC. Links in the CGP MARC records were created to the archived content on the Internet Archive’s Wayback server for each harvested Web site.

Having successfully achieved the proof of concept, Laurie Hall, LSCM’s Director of Library Technical Information Services, charged the Task Force to:

  • Form a Web Archiving Team and develop a project plan toward a full implementation of a Web harvesting and archiving service.
  • Develop modifications needed to LSCM workflow for acquisition, cataloging, classification, archiving, and access, to include whole Web sites as well as individual publications.
  • Develop configurations on cost and staff resources for continuation and expansion of the project, including a budget for FY2013.

Read the Complete Article and Meet the Web Harvesting Pilot Team

See Also: Slide Presentation About Web Harvest Pilot From Depository Library Conference (October 18, 2012; PDF)

Learn More About Archive-It (from the Internet Archive)

About Gary Price

Gary Price ( is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. Before launching INFOdocket, Price and Shirl Kennedy were the founders and senior editors at ResourceShelf and DocuTicker for 10 years. From 2006-2009 he was Director of Online Information Services at, and is currently a contributing editor at Search Engine Land.