Datasheets for Datasets

Description

Datasheets for datasets establish comprehensive documentation standards for datasets, systematically recording creation methodology, data composition, collection procedures, preprocessing transformations, intended applications, potential biases, privacy considerations, and maintenance protocols. These structured documents enhance dataset transparency by providing essential context for appropriate usage, enabling informed decisions about dataset suitability for specific tasks, supporting bias detection and mitigation efforts, ensuring compliance with data protection regulations, and promoting responsible data stewardship throughout the entire data lifecycle from collection to disposal.

Example Use Cases

Privacy

Documenting a medical imaging dataset with detailed information about patient privacy protections, anonymisation procedures, and data sharing constraints to ensure sensitive health information is handled appropriately and regulatory compliance is maintained.

Fairness

Creating comprehensive datasheets for recruitment datasets that document demographic representation across different job categories, helping developers identify potential bias in training data and develop more equitable hiring algorithms.

Transparency

Establishing transparent documentation for financial transaction datasets that clearly describes data collection methodology, preprocessing steps, and intended use cases, enabling researchers to make informed decisions about dataset appropriateness for their specific applications.

Safety

Documenting a safety-critical autonomous vehicle perception dataset with records of sensor calibration procedures, environmental conditions during collection, edge-case scenario coverage, and known distributional gaps, enabling safety engineers to assess whether training data adequately represents the operating conditions required for safe deployment and to support regulatory type-approval submissions.

Reliability

Documenting the statistical distributions, temporal coverage, and known quality characteristics of training datasets to support model monitoring pipelines, enabling operations teams to detect data drift and assess whether production data remains within the documented training distribution for informed decisions about model retraining or recalibration.

Limitations

Creating thorough datasheets requires significant time investment and domain expertise to properly document collection methods, biases, and ethical considerations, potentially delaying dataset release or publication.
Information may become outdated as datasets undergo preprocessing, cleaning, or augmentation, requiring ongoing maintenance to ensure documentation accuracy throughout the data lifecycle.
Absence of standardised templates and enforcement mechanisms leads to inconsistent documentation quality and completeness across different organisations and research communities.
Dataset creators may intentionally omit sensitive information about collection methods, participant consent, or potential biases to avoid legal liability or competitive disadvantage.
Limited adoption and awareness means many existing datasets lack proper documentation, creating gaps in the historical record and making legacy dataset assessment difficult.

Resources

Research Papers

Datasheets for Datasets

Timnit Gebru et al.•Mar 23, 2018

The machine learning community currently has no standardized process for documenting datasets, which can lead to severe consequences in high-stakes domains. To address this gap, we propose datasheets for datasets. In the electronics industry, every component, no matter how simple or complex, is accompanied with a datasheet that describes its operating characteristics, test results, recommended uses, and other information. By analogy, we propose that every dataset be accompanied with a datasheet that documents its motivation, composition, collection process, recommended uses, and so on. Datasheets for datasets will facilitate better communication between dataset creators and dataset consumers, and encourage the machine learning community to prioritize transparency and accountability.

Datasheets for AI and medical datasets (DAIMS): a data validation and documentation framework before machine learning analysis in medical research

Ramtin Zargari Marandi, Anne Svane Frahm, and Maja Milojevic•Jan 23, 2025

Despite progresses in data engineering, there are areas with limited consistencies across data validation and documentation procedures causing confusions and technical problems in research involving machine learning. There have been progresses by introducing frameworks like "Datasheets for Datasets", however there are areas for improvements to prepare datasets, ready for ML pipelines. Here, we extend the framework to "Datasheets for AI and medical datasets - DAIMS." Our publicly available solution, DAIMS, provides a checklist including data standardization requirements, a software tool to assist the process of the data preparation, an extended form for data documentation and pose research questions, a table as data dictionary, and a flowchart to suggest ML analyses to address the research questions. The checklist consists of 24 common data standardization requirements, where the tool checks and validate a subset of them. In addition, we provided a flowchart mapping research questions to suggested ML methods. DAIMS can serve as a reference for standardizing datasets and a roadmap for researchers aiming to apply effective ML techniques in their medical research endeavors. DAIMS is available on GitHub and as an online app to automate key aspects of dataset evaluation, facilitating efficient preparation of datasets for ML studies.

MT-Adapted Datasheets for Datasets: Template and Repository

Marta R. Costa-jussà et al.•May 27, 2020

In this report we are taking the standardized model proposed by Gebru et al. (2018) for documenting the popular machine translation datasets of the EuroParl (Koehn, 2005) and News-Commentary (Barrault et al., 2019). Within this documentation process, we have adapted the original datasheet to the particular case of data consumers within the Machine Translation area. We are also proposing a repository for collecting the adapted datasheets in this research area

Understanding Machine Learning Practitioners' Data Documentation Perceptions, Needs, Challenges, and Desiderata

Amy K. Heger et al.•Jun 6, 2022

Data is central to the development and evaluation of machine learning (ML) models. However, the use of problematic or inappropriate datasets can result in harms when the resulting models are deployed. To encourage responsible AI practice through more deliberate reflection on datasets and transparency around the processes by which they are created, researchers and practitioners have begun to advocate for increased data documentation and have proposed several data documentation frameworks. However, there is little research on whether these data documentation frameworks meet the needs of ML practitioners, who both create and consume datasets. To address this gap, we set out to understand ML practitioners' data documentation perceptions, needs, challenges, and desiderata, with the goal of deriving design requirements that can inform future data documentation frameworks. We conducted a series of semi-structured interviews with 14 ML practitioners at a single large, international technology company. We had them answer a list of questions taken from datasheets for datasets (Gebru, 2021). Our findings show that current approaches to data documentation are largely ad hoc and myopic in nature. Participants expressed needs for data documentation frameworks to be adaptable to their contexts, integrated into their existing tools and workflows, and automated wherever possible. Despite the fact that data documentation frameworks are often motivated from the perspective of responsible AI, participants did not make the connection between the questions that they were asked to answer and their responsible AI implications. In addition, participants often had difficulties prioritizing the needs of dataset consumers and providing information that someone unfamiliar with their datasets might need to know. Based on these findings, we derive seven design requirements for future data documentation frameworks.

Related Techniques

Name	Description	Assurance Goals
Model Cards	Model cards are standardised documentation frameworks that systematically document machine learning models through structured templates. The templates cover intended use cases, performance metrics across different demographic groups and operating conditions, training data characteristics, evaluation procedures, limitations, and ethical considerations. They serve as comprehensive technical specifications that enable informed model selection, prevent inappropriate deployment, support regulatory compliance, and facilitate fair assessment by providing transparent reporting of model capabilities and constraints across diverse populations and scenarios.	General
Data Version Control	Data Version Control (DVC) is a Git-like version control system specifically designed for machine learning data, models, and experiments. It tracks changes to large data files, maintains reproducible ML pipelines, and creates a complete audit trail of data transformations, model training, and evaluation processes. DVC works alongside Git to provide end-to-end lineage tracking from raw data through preprocessing, training, and deployment, enabling teams to reproduce any model version and understand exactly how datasets evolved throughout the ML lifecycle.	General
Model Development Audit Trails	Model development audit trails create comprehensive, immutable records of all decisions, experiments, and changes throughout the ML lifecycle. This technique captures dataset versions, training configurations, hyperparameter searches, model architecture changes, evaluation results, and deployment decisions in tamper-evident logs. Audit trails enable reproducibility, accountability, and regulatory compliance by documenting who made what changes when and why, supporting post-hoc investigation of model failures or unexpected behaviours.	General

Datasheets for Datasets

Description

Example Use Cases

Privacy

Fairness

Transparency

Safety

Reliability

Limitations

Resources

Research Papers

Datasheets for Datasets

Datasheets for AI and medical datasets (DAIMS): a data validation and documentation framework before machine learning analysis in medical research

MT-Adapted Datasheets for Datasets: Template and Repository

Understanding Machine Learning Practitioners' Data Documentation Perceptions, Needs, Challenges, and Desiderata

Related Techniques

Tags