From Yale News:
Yale and Columbia economists are building a massive dataset to better understand the role immigrants played in transforming the United States from its rural origins into a global economic power.
The researchers will merge individual level data from the historical U.S. Census with the records of 11 million immigrants who arrived at the port of New York between 1820 and 1892; the passenger lists of 5.5 million immigrants who departed for the United States from the port of Hamburg, Germany, between 1850 and 1934; data from the Historical Census of Manufacturers on manufacturing employment and productivity at the county level from 1860 to 1929; and records from the U.S. Patent and Trademark Office covering a similar timespan.
The resulting dataset will provide the researchers with an unprecedented trove of evidence of immigration’s impact on American prosperity, said Costas Arkolakis, professor of economics and one of the project’s principal investigators.
Specifically, the dataset will help the researchers understand the extent to which the novel ideas and expertise immigrants brought to U.S. shores drove the nation’s emergence as an industrial and technological powerhouse, explained Michael Peters, assistant professor of economics at Yale, who, together with Sun Kyoung Lee, Ph.D. candidate in economics at Columbia University, are the other two principal investigators of the project.
The researchers recently received a $1 million grant from the National Science Foundation to build the dataset.
Merging several massive databases and matching the records within them presents challenges — some straightforward and others more complex, note the researchers. For instance, the passenger lists were recorded in German and must be translated. Most datasets that the researchers deal with are in paper format, which involves image-to-text conversion. Even if some datasets are in machine-readable format, the data is not harmonized. For example, in Steinway’s case, some sources list his occupation as “instrument maker” whereas others record it as “laborer.” Moreover, the absence of a time-invariant individual-specific identifier makes tracking the same individual challenging.
Many of these techniques to connect these large datasets rely on modern machine learning tools and have been developed as a part of Lee’s doctoral thesis.
“We just have to keep finding creative ways to deal with all kinds of oddities in the data,” Lee said.
The dataset eventually could be expanded to include modern datasets containing post-1940 information as well as data on economic activity in sectors other than manufacturing and those works are in motion, according to the researchers.
See Also: RIDIR: A Big Data Approach to Understanding American Growth (via NSF)
Award Abstract #1831524.