The public is now able to download full datasets of the National Archives Catalog archival descriptions and authority records , as well as the entirety of the 1940 census, for the first time. This free service will provide researchers access through the Amazon Web Services (AWS) Registry of Open Data.
Until now, this data was available through the Catalog and the 1940 census websites, but not in bulk. This release aligns with the National Archives’ effort to Make Access Happen for the records in its care. This is the first time the National Archives is releasing a census dataset in full.
The Catalog dataset includes 225 gigabytes of data, including archival descriptions that have record group/collection descriptions, series descriptions, file unit descriptions, and item descriptions as well as the URLs for over 127 million digital copies and data from citizen archivist contributions.
The National Archives intends to update the dataset on the Registry of Open Data regularly.
The 1940 census dataset contains the images of the entirety of the digitized 1940 census and has 15 terabytes of data: the metadata index and 3.7 million images of the population schedules, the enumeration district maps, and the enumeration district descriptions.
The tools available through AWS will allow researchers to review, for example, specific sections of the census, like records of one state or county. Previously, that task would have required reviewing individual images in the Catalog or using technical knowledge to query and download the data and images, with limits on how much data could be queried at once.
The National Archives Catalog dataset–over 225 gigabytes of data–includes the archival descriptions and authority records from the National Archives Catalog (as of November 20, 2020), including the URLs for over 127 million digital copies and data from citizen archivist contributions. We plan to update the dataset on the Registry of Open Data regularly.
In addition to the Registry of Open Data entries for these datasets, NARA published detailed documentation to guide users on how to access both the full datasets and specific subsets of the data. Users can access both datasets using their respective Amazon Resource Names (ARNs), a method to uniquely identify resources on AWS so that users can locate the dataset, or with AWS Command Line Interface (CLI), an open source tool that enables users to interact with AWS services using commands in their command-line. For the Catalog dataset, we have also provided zip files for easy download of the full dataset.
Read the Complete Post