Data search and discoverability is high on the wish list of scientists and U.S. government agencies alike. Scientists want to learn what’s already been done, connect with other researchers, and reuse code when possible. And research agencies are under legislative pressure from the Foundations of Evidence-Based Policymaking Act of 2018, which requires agencies to solicit feedback from users on the utility and accessibility of their data, to connect with data users. The problem, however, is most U.S. agencies have no idea who is using their data or for what purpose. Even for important public policy problems that need statistical data or data gathered as part of administering major programs such as Medicaid, income taxes, Supplemental Nutrition Assistance Program (SNAP), or veteran’s health, agencies can only track how many hits they receive on the websites where they have made their public data sets available.
We can do much better. What is needed are automated search tools that help agencies see who is using their data in research, help researchers discover how others are using these data for research similar to their own, and in the course of making these community connections, help spark new research and better information to tackle priority social and economic issues.
Fortunately, there is good news. Development of these tools is well underway. One ongoing effort, Show US the Data, has progressed to a major, multi-agency pilot project and is already showing early results. It is the brainchild of a group of collaborators led by New York University Professor Julia Lane, former Federal Chief Information Officer Suzette Kent, and me, and also includes the Texas Advanced Computing Center at the University of Austin, the global publishing and analytics company Elsevier, and the Institute for Data Intensive Engineering and Science at Johns Hopkins University. Because most scientific research is reported in journal articles, Lane and her partners have set out to use machine learning and natural language processing to conduct rich text analysis on a corpus of over 84 million publications under the management of publisher Elsevier to find citations of specific data sets. With backing from the Overdeck Family Foundation, Schmidt Futures, and CHORUS, Show US the Data ran a Kaggle competition to develop algorithms that would be up to the task. Over 1600 data science teams worldwide competed. The winning algorithms were unveiled at a conference in October 2021 and have been incorporated into a pilot project in which six federal agencies are participating, including the Economic Research Service and National Agricultural Statistics Service at the U.S. Department of Agriculture, the National Center for Science and Engineering Statistics at the National Science Foundation, the National Center for Education Statistics at the U.S. Department of Education, the National Oceanic and Atmospheric Administration at the U.S. Department of Commerce, and NASA.
See Also: Show US the Data Conference Recording