May 16, 2022

Pilot Project Intro: “Show US the Data”

From Harvard Data Science Review:

Data search and discoverability is high on the wish list of scientists and U.S. government agencies alike. Scientists want to learn what’s already been done, connect with other researchers, and reuse code when possible. And research agencies are under legislative pressure from the Foundations of Evidence-Based Policymaking Act of 2018, which requires agencies to solicit feedback from users on the utility and accessibility of their data, to connect with data users. The problem, however, is most U.S. agencies have no idea who is using their data or for what purpose. Even for important public policy problems that need statistical data or data gathered as part of administering major programs such as Medicaid, income taxes, Supplemental Nutrition Assistance Program (SNAP), or veteran’s health, agencies can only track how many hits they receive on the websites where they have made their public data sets available.

We can do much better. What is needed are automated search tools that help agencies see who is using their data in research, help researchers discover how others are using these data for research similar to their own, and in the course of making these community connections, help spark new research and better information to tackle priority social and economic issues.

Fortunately, there is good news. Development of these tools is well underway. One ongoing effort, Show US the Data, has progressed to a major, multi-agency pilot project and is already showing early results. It is the brainchild of a group of collaborators led by New York University Professor Julia Lane, former Federal Chief Information Officer Suzette Kent, and me, and also includes the Texas Advanced Computing Center at the University of Austin, the global publishing and analytics company Elsevier, and the Institute for Data Intensive Engineering and Science at Johns Hopkins University. Because most scientific research is reported in journal articles, Lane and her partners have set out to use machine learning and natural language processing to conduct rich text analysis on a corpus of over 84 million publications under the management of publisher Elsevier to find citations of specific data sets. With backing from the Overdeck Family Foundation, Schmidt Futures, and CHORUS, Show US the Data ran a Kaggle competition to develop algorithms that would be up to the task. Over 1600 data science teams worldwide competed. The winning algorithms were unveiled at a conference in October 2021 and have been incorporated into a pilot project in which six federal agencies are participating, including the Economic Research Service and National Agricultural Statistics Service at the U.S. Department of Agriculture, the National Center for Science and Engineering Statistics at the National Science Foundation, the National Center for Education Statistics at the U.S. Department of Education, the National Oceanic and Atmospheric Administration at the U.S. Department of Commerce, and NASA.

Learn More, Read the Complete Article, Listen to Podcasts (approx. 2050 words)

See Also: Show US the Data Conference  October 20, 2021 (Final Report)

See Also: On a Mission to Make Federal Data Sets More Useful and Accessible (via SPARC)

See Also: Data Inventories for the Modern Age? Using Data Science to Open Government Data(via Coleridge Initiative)

See Also: Show US the Data Conference Recording

About Gary Price

Gary Price (gprice@mediasourceinc.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. Before launching INFOdocket, Price and Shirl Kennedy were the founders and senior editors at ResourceShelf and DocuTicker for 10 years. From 2006-2009 he was Director of Online Information Services at Ask.com, and is currently a contributing editor at Search Engine Land.

Share