Gori publishes a report on “Legal Protection Debt in ML training datasets”

Postdoctoral researcher Gori has conducted research on the practices of ML dataset creation, curation and dissemination. The report emphasises how such practices play a crucial role for determining the level of legal protection enjoyed by the legal subjects located downstream ML-pipelines. The report illustrates how some structural features of ML-pipelines can give rise to the problem of “many hands” and to the accumulation of various forms of “technical debt”. The report argues that, lacking appropriate safeguards, a “Legal Protection Debt” can incrementally build up along the different stages of the pipeline. The report therefore stresses the need that actors involved in ML pipelines adopt a forward-looking approach to legal compliance. This requires overcoming a siloed and modulated understanding of legal liability and paying of keen attention to the potential use cases of datasets.

The report illustrates how the requirements set out by data protection law can help overcoming such challenges, while at the same time facilitating the compliance with the standards set by the Open Data and Open Science framework. The report contains an Annex that analyses the special regime established by the GDPR for the processing of personal data performed for scientific research purposes.

The report has been developed in the context of the HumanE AI Net project and follows from the involvement in a microproject on “Collection of datasets tailored for HumanE-AI multimodal perception and modelling”. The HumanE AI Net project is funded under the European Union’s Horizon 2020 research and innovation program (Grant Agreement No. 952026). HumanE AI Net aims at facilitating AI systems that enhance human capabilities and empower individuals and society as a whole while respecting human autonomy and self-determination.

Click here for the PDF of the report.

Discussion