Big Data: Yahoo Releases the Largest-ever Machine Learning Dataset for Researchers
From the Yahoo Labs Announcement:
Today, we are proud to announce the public release of the largest-ever machine learning dataset to the research community. The dataset stands at a massive ~110B events (13.5TB uncompressed) of anonymized user-news item interaction data, collected by recording the user-news item interactions of about 20M users from February 2015 to May 2015.
The Yahoo News Feed dataset is a collection based on a sample of anonymized user interactions on the news feeds of several Yahoo properties, including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate.
Our goals are to promote independent research in the fields of large-scale machine learning and recommender systems, and to help level the playing field between industrial and academic research.
The dataset is available as part of the Yahoo Labs Webscope data-sharing program, which is a reference library of scientifically-useful datasets comprising anonymized user data for non-commercial use.
In addition to the interaction data, we are providing categorized demographic information (age range, gender, and generalized geographic data) for a subset of the anonymized users. On the item side, we are releasing the title, summary, and key-phrases of the pertinent news article. The interaction data is timestamped with the relevant local time and also contains partial information about the device on which the user accessed the news feeds, which allows for interesting work in contextual recommendation and temporal data mining.
About Gary Price
Gary Price (firstname.lastname@example.org) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. He earned his MLIS degree from Wayne State University in Detroit. Price has won several awards including the SLA Innovations in Technology Award and Alumnus of the Year from the Wayne St. University Library and Information Science Program. From 2006-2009 he was Director of Online Information Services at Ask.com. Gary is also the co-founder of infoDJ an innovation research consultancy supporting corporate product and business model teams with just-in-time fact and insight finding.