The paper linked to below was recently posted on arXiv.
From the Abstract
We introduce the Elsevier OA CC-BY corpus. This is the first open corpus of Scientific Researchpapers which has a representative sample from across scientific disciplines. This corpus not only includes the full text of the article, but also the metadata of the documents, along with the bibliographic information for each reference.
From the Intoduction
This is a corpus of 40k (40,001) open access (OA) CC-BY articles from across Elsevier’s journals representing a large scale, cross-discipline set of research data to support NLP and ML research.
Research into the application of NLP and Machine Learning toscholarly content has attracted considerable attentionin recent years. However, progress has been held back because of limited availability of large, cross-discipline datasets. Through releasing this dataset we hope to aid the research community in their work to expand the understanding ofcommonalities and differences between processing of scientific text and text of a different nature (e.g. news text). Moreover, this dataset allows research on challenges in processing scientific text that do not exist for other types ofdata.
In this article we want to report on details of the dataset being released, focusing on the structure of the data, coverageof features and tools which can be used with it.
Direct to Full Text Article (v. 2)
Direct to Dataset Info and Download