On November 26th, the Library of Congress published a RFI (request for information) regarding web harvesting.
From the Synopsis:
The Library of Congress, Office of Strategic Initiatives (OSI) is seeking information from potential contractors about how to best to design a requirement related to saving and reviewing information from the Internet. The Library is seeking information, e.g., current/existing commercial solutions, design solutions, etc., on how to best meet this web harvesting requirement.
This RFI is to determine if potential offerors can meet the Library’s technical and production requirements for harvesting web content and to receive feedback on pricing models and reasonable quality assurance. The Library is actively seeking suggested solutions and alternatives that will meet our requirements.
From the RFI:
Many of the activities of the digital lifecycle for harvested web content occur at the Library of Congress, including seed URL nomination, permissions gathering, scoping and preparation of a seed list, quality review, and public access to researchers. The Library’s web harvesting curator tools and infrastructure have been developed for the inputs and outputs of open source tools (Heritrix for harvesting, and Wayback Machine for access). The potential requirements described here are to support the Library’s large-scale, ongoing harvesting efforts, plus storage for the life of any potential contract, indexing for access, restricted access to the content for processing by Library staff, and transfer to the Library for long-term storage.
Although the following provides a general description of the Library’s potential requirements, the Library is actively seeking suggested alternatives to the requirements discussed below, where appropriate.
Direct to Complete RFI (22 pages; PDF
Full text is also embedded below.
See Also: Web Archiving In the United States: A 2013 Survey
October 2013. 25 pages; PDF.
Published by the National Digital Stewardship Alliance.