GeoDeepDive combines library science, computer science, and geoscience to dive into repositories of published text, tables, and figures and return valuable information.
Scientific publications contain measurements, descriptions, and images that have utility beyond the aims of the original work, particularly when they are aggregated into databases. For example, the Paleobiology Database contains field- and museum-based descriptions of more than 1.3 million fossil occurrences compiled from some 50,000 references, and sample-based geochemical data from the published literature are available in EarthChem. Both databases can be used to address fundamental scientific questions, but neither is complete. Plus, adding to existing literature-based data syntheses and constructing new ones is difficult and can be prohibitively time-consuming.
The primary goal of our U.S. National Science Foundation EarthCube building block project, GeoDeepDive, is to facilitate the creation and augmentation of literature-derived databases and to leverage published knowledge and past investments in data acquisition. The project combines library science (the aggregation and curation of digital documents and bibliographic metadata), geoscience (the generation of research questions and labeling of terms in externally managed scientific ontologies), and computer science (the use of high-throughput computing infrastructure and machine reading systems to parse and extract data from millions of documents)