Dataset: Elsevier Releases OA CC-By Corpus For NLP and AI Research
The paper linked to below was recently posted on arXiv.
From the Abstract
We introduce the Elsevier OA CC-BY corpus. This is the first open corpus of Scientific Researchpapers which has a representative sample from across scientific disciplines. This corpus not only includes the full text of the article, but also the metadata of the documents, along with the bibliographic information for each reference.
From the Intoduction
This is a corpus of 40k (40,001) open access (OA) CC-BY articles from across Elsevier’s journals representing a large scale, cross-discipline set of research data to support NLP and ML research.
Research into the application of NLP and Machine Learning toscholarly content has attracted considerable attentionin recent years. However, progress has been held back because of limited availability of large, cross-discipline datasets. Through releasing this dataset we hope to aid the research community in their work to expand the understanding ofcommonalities and differences between processing of scientific text and text of a different nature (e.g. news text). Moreover, this dataset allows research on challenges in processing scientific text that do not exist for other types ofdata.
In this article we want to report on details of the dataset being released, focusing on the structure of the data, coverageof features and tools which can be used with it.
Direct to Full Text Article (v. 2)
Direct to Dataset Info and Download
About Gary Price
Gary Price (firstname.lastname@example.org) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. He earned his MLIS degree from Wayne State University in Detroit. Price has won several awards including the SLA Innovations in Technology Award and Alumnus of the Year from the Wayne St. University Library and Information Science Program. From 2006-2009 he was Director of Online Information Services at Ask.com. Gary is also the co-founder of infoDJ an innovation research consultancy supporting corporate product and business model teams with just-in-time fact and insight finding.