The performance and reliability of Artificial Intelligence (AI) models are closely linked to the quality and diversity of the data used in their training. While the technical aspects of model development often take centre stage, the underlying methods for sourcing and assembling training data are equally relevant. Different data collection mechanisms bring distinct advantages and challenges, not only for AI developers seeking robust and more representative models, but also for individuals whose data may be included. In practice, AI developers often employ multiple data collection mechanisms concurrently to build comprehensive training datasets. Understanding these mechanisms is essential for advancing trustworthy AI systems and for addressing privacy and data governance considerations in the development process.
Source: OECD
Accordingly, this paper maps and proposes a taxonomy of the principal mechanisms currently used to source data for training AI systems. This taxonomy aims to provide a basis for future analysis on the privacy and data governance implications of each mechanism.
The taxonomy organises these key data collection mechanisms into the following structure:
1. Data collected directly from individuals and organisations
• Provided and observed data: A growing volume of training data originates from data submitted by individuals or passively collected during their interactions with AI systems, particularly in business-to-consumer (B2C) settings such as chatbots, virtual assistants, and automated helpdesks. Additionally, some AI developers, such as social media platforms, may leverage data provided or observed from individuals across their broader portfolio to support AI model training.
• Voluntary data donations: Although still emerging, voluntary data contributions from individuals or organisations offer the potential to enrich training datasets with diverse, real-world information that may otherwise be difficult to access.
2. Data collected from third-party providers
• Commercial data licensing: Data licensing agreements with organisations offer another avenue for AI developers to access datasets. Data marketplaces and data brokers play a relevant role as data intermediaries in this ecosystem, offering access to a wide variety of third-party data.
• Non-commercial practices: AI developers may also obtain datasets through non-commercial means. Open data initiatives, encompassing both public and private sector data released under open licenses, are key sources for the development of AI models. Significant contributors in this context are dataset publishers who curate and organise datasets from various sources and make them freely and openly available. Given the need for large and diverse datasets to support AI training processes, data scraping has emerged as a widely adopted data collection mechanism to address these demands.
By developing this taxonomy, the paper offers policymakers and stakeholders a structured approach for policy discussions on privacy, data governance, and trustworthy AI development. The output underscores the complexity and variety of data collection mechanisms that AI developers rely on, noting that emerging approaches involving secure processing environments and tools such as Privacy-Enhancing Technologies (PETs) offer ways to improve the usability of these data collection mechanisms while safeguarding privacy and other rights and interests such as intellectual property. This taxonomy sets the groundwork for further analysis on how to balance the growing demand for AI training data (in terms of volume and variety) while also accounting for privacy and data governance aspects such as data quality and traceability.
Gary Price (gprice@gmail.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area.
He earned his MLIS degree from Wayne State University in Detroit.
Price has won several awards including the SLA Innovations in Technology Award and Alumnus of the Year from the Wayne St. University Library and Information Science Program. From 2006-2009 he was Director of Online Information Services at Ask.com.