Yale and Columbia Economists are Developing Dataset to Show Immigration’s Impact on American Prosperity
From Yale News:
Yale and Columbia economists are building a massive dataset to better understand the role immigrants played in transforming the United States from its rural origins into a global economic power.
The researchers will merge individual level data from the historical U.S. Census with the records of 11 million immigrants who arrived at the port of New York between 1820 and 1892; the passenger lists of 5.5 million immigrants who departed for the United States from the port of Hamburg, Germany, between 1850 and 1934; data from the Historical Census of Manufacturers on manufacturing employment and productivity at the county level from 1860 to 1929; and records from the U.S. Patent and Trademark Office covering a similar timespan.
The resulting dataset will provide the researchers with an unprecedented trove of evidence of immigration’s impact on American prosperity, said Costas Arkolakis, professor of economics and one of the project’s principal investigators.
[Clip]
Specifically, the dataset will help the researchers understand the extent to which the novel ideas and expertise immigrants brought to U.S. shores drove the nation’s emergence as an industrial and technological powerhouse, explained Michael Peters, assistant professor of economics at Yale, who, together with Sun Kyoung Lee, Ph.D. candidate in economics at Columbia University, are the other two principal investigators of the project.
[Clip]
The researchers recently received a $1 million grant from the National Science Foundation to build the dataset.
[Clip]
Merging several massive databases and matching the records within them presents challenges — some straightforward and others more complex, note the researchers. For instance, the passenger lists were recorded in German and must be translated. Most datasets that the researchers deal with are in paper format, which involves image-to-text conversion. Even if some datasets are in machine-readable format, the data is not harmonized. For example, in Steinway’s case, some sources list his occupation as “instrument maker” whereas others record it as “laborer.” Moreover, the absence of a time-invariant individual-specific identifier makes tracking the same individual challenging.
Many of these techniques to connect these large datasets rely on modern machine learning tools and have been developed as a part of Lee’s doctoral thesis.
“We just have to keep finding creative ways to deal with all kinds of oddities in the data,” Lee said.
The dataset eventually could be expanded to include modern datasets containing post-1940 information as well as data on economic activity in sectors other than manufacturing and those works are in motion, according to the researchers.
Learn More, Read the Complete Article
See Also: RIDIR: A Big Data Approach to Understanding American Growth (via NSF)
Award Abstract #1831524.
Filed under: Awards, Data Files, Funding, Journal Articles, News, Productivity
About Gary Price
Gary Price (gprice@gmail.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. He earned his MLIS degree from Wayne State University in Detroit. Price has won several awards including the SLA Innovations in Technology Award and Alumnus of the Year from the Wayne St. University Library and Information Science Program. From 2006-2009 he was Director of Online Information Services at Ask.com.