The following article appears in the January/February 2016 issue of D-Lib Magazine.
Authors
Justin F. Brunelle
The MITRE Corporation and Old Dominion University
Krista Ferrante and Eliot Wilczek
The MITRE Corporation
Michele C. Weigle and Michael L. Nelson
Old Dominion University
Source
D-Lib Magazine
Vol 22, No. 1-2 (January/February 2016)
Authors
In this work, we present a case study in which we investigate using open-source, web-scale web archiving tools (i.e., Heritrix and the Wayback Machine installed on the MITRE Intranet) to automatically archive a corporate Intranet. We use this case study to outline the challenges of Intranet web archiving, identify situations in which the open source tools are not well suited for the needs of the corporate archivists, and make recommendations for future corporate archivists wishing to use such tools. We performed a crawl of 143,268 URIs (125 GB and 25 hours) to demonstrate that the crawlers are easy to set up, efficiently crawl the Intranet, and improve archive management. However, challenges exist when the Intranet contains sensitive information, areas with potential archival value require user credentials, or archival targets make extensive use of internally developed and customized web services. We elaborate on and recommend approaches for overcoming these challenges.
Direct to Full Text Article