SUBSCRIBE
SUBSCRIBE
EXPLORE +
  • About infoDOCKET
  • Academic Libraries on LJ
  • Research on LJ
  • News on LJ
  • Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Libraries
    • Academic Libraries
    • Government Libraries
    • National Libraries
    • Public Libraries
  • Companies (Publishers/Vendors)
    • EBSCO
    • Elsevier
    • Ex Libris
    • Frontiers
    • Gale
    • PLOS
    • Scholastic
  • New Resources
    • Dashboards
    • Data Files
    • Digital Collections
    • Digital Preservation
    • Interactive Tools
    • Maps
    • Other
    • Podcasts
    • Productivity
  • New Research
    • Conference Presentations
    • Journal Articles
    • Lecture
    • New Issue
    • Reports
  • Topics
    • Archives & Special Collections
    • Associations & Organizations
    • Awards
    • Funding
    • Interviews
    • Jobs
    • Management & Leadership
    • News
    • Patrons & Users
    • Preservation
    • Profiles
    • Publishing
    • Roundup
    • Scholarly Communications
      • Open Access

August 5, 2020 by Gary Price

Dataset: Elsevier Releases OA CC-By Corpus For NLP and AI Research

August 5, 2020 by Gary Price

The paper linked to below was recently posted on arXiv.

Title

Elsevier OA CC-By Corpus

Authors

Daniel Kershaw
Elsevier

Rob Koeling
Elsevier

Source

via arXiv

From the Abstract

We introduce the Elsevier OA CC-BY corpus. This is the first open corpus of Scientific Researchpapers which has a representative sample from across scientific disciplines. This corpus not only includes the full text of the article, but also the metadata of the documents, along with the bibliographic information for each reference.

From the Intoduction

This is a corpus of 40k (40,001) open access (OA) CC-BY articles from across Elsevier’s journals representing a large scale, cross-discipline set of research data to support NLP and ML research.

Research into the application of NLP and Machine Learning toscholarly content has attracted considerable attentionin recent years.  However, progress has been held back because of limited availability of large, cross-discipline datasets. Through releasing this dataset we hope to aid the research community in their work to expand the understanding ofcommonalities and differences between processing of scientific text and text of a different nature (e.g. news text). Moreover, this dataset allows research on challenges in processing scientific text that do not exist for other types ofdata.

In this article we want to report on details of the dataset being released, focusing on the structure of the data, coverageof features and tools which can be used with it.

Direct to Full Text Article (v. 2)

Direct to Dataset Info and Download

See Also: Leveraging Machine Learning to Fuel New Discoveries with the arXiv Dataset (via arXiv)

Filed under: Data Files, Elsevier, Journal Articles, News, Open Access

SHARE:

About Gary Price

Gary Price (gprice@gmail.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. He earned his MLIS degree from Wayne State University in Detroit. Price has won several awards including the SLA Innovations in Technology Award and Alumnus of the Year from the Wayne St. University Library and Information Science Program. From 2006-2009 he was Director of Online Information Services at Ask.com.

ADVERTISEMENT

Archives

Job Zone

ADVERTISEMENT

Related Infodocket Posts

ADVERTISEMENT

FOLLOW US ON X

Tweets by infoDOCKET

ADVERTISEMENT

This coverage is free for all visitors. Your support makes this possible.

This coverage is free for all visitors. Your support makes this possible.

Primary Sidebar

  • News
  • Reviews+
  • Technology
  • Programs+
  • Design
  • Leadership
  • People
  • COVID-19
  • Advocacy
  • Opinion
  • INFOdocket
  • Job Zone

Reviews+

  • Booklists
  • Prepub Alert
  • Book Pulse
  • Media
  • Readers' Advisory
  • Self-Published Books
  • Review Submissions
  • Review for LJ

Awards

  • Library of the Year
  • Librarian of the Year
  • Movers & Shakers 2022
  • Paralibrarian of the Year
  • Best Small Library
  • Marketer of the Year
  • All Awards Guidelines
  • Community Impact Prize

Resources

  • LJ Index/Star Libraries
  • Research
  • White Papers / Case Studies

Events & PD

  • Online Courses
  • In-Person Events
  • Virtual Events
  • Webcasts
  • About Us
  • Contact Us
  • Advertise
  • Subscribe
  • Media Inquiries
  • Newsletter Sign Up
  • Submit Features/News
  • Data Privacy
  • Terms of Use
  • Terms of Sale
  • FAQs
  • Careers at MSI


© 2026 Library Journal. All rights reserved.


© 2022 Library Journal. All rights reserved.