The March 2013 issue of the Federal Depository Library Program (FDLP) includes an update about the Government Printing Office (GPO) web harvesting pilot.
In late 2011, Library Services and Content Management (LSCM) and OAM staff developed a pilot project to test an implementation of the Internet Archive’s Heritrix-based Archive-It, which is a subscription-based Web harvesting and archiving service. In developing the pilot project, the project team networked with Web harvesting teams from the Library of Congress, the National Archives and Records Administration, and the University of North Texas Library (a GPO library partner already well-known for establishing the CyberCemetery and its leadership in digital preservation initiatives).
While each of these GPO partners and more than 228 libraries and agencies had proven the basic concept and viability of Heritrix and Archive-It, the Web Harvesting Task Force was charged with determining whether Archive-It would work within LSCM’s operational budget and staffing parameters.
Test crawls were conducted on ten test Web sites, and the resulting facsimile harvested copies were reviewed for performance. MARC records were created in the CGP by performing a crosswalk from the Archive-It Dublin Core metadata to MARC. Links in the CGP MARC records were created to the archived content on the Internet Archive’s Wayback server for each harvested Web site.
Having successfully achieved the proof of concept, Laurie Hall, LSCM’s Director of Library Technical Information Services, charged the Task Force to:
- Form a Web Archiving Team and develop a project plan toward a full implementation of a Web harvesting and archiving service.
- Develop modifications needed to LSCM workflow for acquisition, cataloging, classification, archiving, and access, to include whole Web sites as well as individual publications.
- Develop configurations on cost and staff resources for continuation and expansion of the project, including a budget for FY2013.
Learn More About Archive-It (from the Internet Archive)