How does an archivist understand the relationship among billions of documents or search for a single record in a sea of data? With the proliferation of digital records, the task of the archivist has grown more complex. This problem is especially acute for the National Archives and Records Administration (NARA), the government agency responsible for managing and preserving the nation’s historical records.
At the end of President George W. Bush’s administration in 2009, NARA received roughly 35 times the amount of data as previously received from the administration of President Bill Clinton, which itself was many times that of the previous administration. With the federal government increasingly using social media, cloud computing and other technologies to contribute to open government, this trend is not likely to decline. By 2014, NARA is expecting to accumulate more than 35 petabytes (quadrillions of bytes) of data in the form of electronic records.
“The National Archives is a unique national institution that responds to requirements for preservation, access and the continued use of government records,” said Robert Chadduck, acting director for the National Archives Center for Advanced Systems and Technologies.
To find innovative and scalable solutions to large-scale electronic records collections, Chadduck turned to the Texas Advanced Computing Center (TACC), a National Science Foundation- (NSF) funded center for advanced computing research, to draw on the expertise of TACC’s digital archivist, Maria Esteva, and data analysis expert, Weijia Xu.
Archivists spend a significant amount of time determining the organization, contents and characteristics of collections so they can describe them for public access purposes. “This process involves a set of standard practices and years of experience from the archivist side,” said Xu. “To accomplish this task in large-scale digital collections, we are developing technologies that combine computing power with domain expertise.”