Dataset: Elsevier Releases OA CC-By Corpus For NLP and AI Research

August 5, 2020 by Gary Price

The paper linked to below was recently posted on arXiv.

Title

Elsevier OA CC-By Corpus

Authors

Daniel Kershaw
Elsevier

Rob Koeling
Elsevier

Source

via arXiv

From the Abstract

We introduce the Elsevier OA CC-BY corpus. This is the first open corpus of Scientific Researchpapers which has a representative sample from across scientific disciplines. This corpus not only includes the full text of the article, but also the metadata of the documents, along with the bibliographic information for each reference.

From the Intoduction

This is a corpus of 40k (40,001) open access (OA) CC-BY articles from across Elsevier’s journals representing a large scale, cross-discipline set of research data to support NLP and ML research.

Research into the application of NLP and Machine Learning toscholarly content has attracted considerable attentionin recent years. However, progress has been held back because of limited availability of large, cross-discipline datasets. Through releasing this dataset we hope to aid the research community in their work to expand the understanding ofcommonalities and differences between processing of scientific text and text of a different nature (e.g. news text). Moreover, this dataset allows research on challenges in processing scientific text that do not exist for other types ofdata.

In this article we want to report on details of the dataset being released, focusing on the structure of the data, coverageof features and tools which can be used with it.

Direct to Full Text Article (v. 2)

Direct to Dataset Info and Download

Filed under: Data Files, Elsevier, Journal Articles, News, Open Access

About Gary Price

Gary Price (gprice@gmail.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. He earned his MLIS degree from Wayne State University in Detroit. Price has won several awards including the SLA Innovations in Technology Award and Alumnus of the Year from the Wayne St. University Library and Information Science Program. From 2006-2009 he was Director of Online Information Services at Ask.com.

Dataset: Elsevier Releases OA CC-By Corpus For NLP and AI Research

About Gary Price

Archives

FOLLOW US ON TWITTER

Dataset: Elsevier Releases OA CC-By Corpus For NLP and AI Research

About Gary Price

Archives

Related Infodocket Posts

FOLLOW US ON TWITTER