生成式人工智能训练数据的确权保护路径

doi:10.3785/j.issn.1008-942X.CN33-6000/C.2024.01.181

Abstract
Figure/Table
References
Related Citation (6)

Download: PDF (753 KB) HTML (1 KB)
Export: BibTeX | EndNote (RIS)

Abstract To stimulate further productive potential of data, China proposed a framework of foundational data rules in December 2022. It is a property-right-based framework, consisting of data circulation, revenue distribution, and security system. Nevertheless, technical logic is inconsistent in the process of data collection, machine learning, and content generation associated with the development of large language models compared to the logic in those foundational data rules. Consequently, the current legislative framework is incapable of resolving numerous data rights and interests disputes due to this logical contradiction. Moreover, the unique attributes that training data requires also make the legal disputes they trigger more complicated and more worthy of research.To counteract the legal risks following this cutting-edge technology, China has promulgated the “Interim Administrative Measures for Generative AI Services” and is actively engaged in legislative movements about AI. To ensure the development of this technology is more suitable for human safety and morality. It is essential to establish a governance framework that incorporates specific regulations protecting the rights and interests involved in training data. Meanwhile, risk management strategies and other measures must be added to construct a stable and sustainable technological ecosystem. First, the incentive theory provides an interpretation: the rights and interests of training data authorized to be owned can motivate developers to take effective measures to protect them. This can help to address the issue of missing native training data for large-scale model development worldwide, thus relieving the logical conflict between training data development and foundational data rules. Second, according to Locke’s labour theory of property, if the ownership is guaranteed, it can be seen as a recognition of the labor value of model developers. How the value of training data is ascertained is in accordance with how the value of civil property rights is confirmed in legislation.In the construction of this new type of right, the subject should be the developer who adopts certain algorithms for model training based on the training data. The object of the right should be the training data that satisfies the requirements of legitimacy and originality. On the one hand, the content of the right should also encompass the four powers of traditional rights: holding, using, benefiting, and disposing. On the other hand, the regulatory framework of a new type of rights differs from traditional property rights due to the non-exclusivity, scalability, and iterativity of training data, and it heavily relies on technological enforcement. Once the fundamental framework has been established, the right should be aligned with multiple interests, including public interests, personality rights, and patents to build up internal consistency within the legal system and maximize external social utility.Ultimately, based on the right that has been established, it is necessary to construct a domestic open-source data standard system for large-scale models, ensuring the legitimate rights and interests of all participants and preventing the monopolization and misuse of training data in open-source sharing.

Key words： generative artificial intelligence training data data rights property rights LLMs

Received: 18 January 2024

	Service

	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Wei Bin
	Pan Zhenghao

Cite this article:

Wei Bin,Pan Zhenghao. A Study of Rights Protection Path of Generative Artificial Intelligence Training Data[J]. JOURNAL OF ZHEJIANG UNIVERSITY, 2026, 56(4): 5-17.

URL:

https://www.zjujournals.com/soc/EN/10.3785/j.issn.1008-942X.CN33-6000/C.2024.01.181 OR https://www.zjujournals.com/soc/EN/Y2026/V56/I4/5

1 Heller M. A.， “The tragedy of the anticommons： property in the transition from Marx to markets，” Harvard Law Review， Vol. 111， No. 3 （1998）， pp. 621-688.
2 焦和平：《人工智能创作中数据获取与利用的著作权风险及化解路径》，《当代法学》2022年第4期，第128-140页。
3 Hacker P.， Engel A. & Mauer M.， “Regulating ChatGPT and other large generative AI models，” in Edwards L.， Hullman J. & Kasirzadeh A. et al. （eds.）， Proceedings of the 2023 ACM Conference on Fairness， Accountability， and Transparency （FAccT’ 23）， New York： Association for Computing Machinery， 2023， pp. 1112-1123.
4 Bender E. M. & Friedman B.， “Data statements for natural language processing： toward mitigating system bias and enabling better science，” Transactions of the Association for Computational Linguistics， Vol. 6 （2018）， pp. 587-604.
5 Bashir M.， Hayes C. & Lambert A. D. et al.， “Online privacy and informed consent： the dilemma of information asymmetry，” Proceedings of the Association for Information Science and Technology， Vol. 52， No. 1 （2015）， pp. 1-10.
6 马长山主编：《数字法治概论》，北京：法律出版社，2022年。
7 Schwalbe U.， “Algorithms， machine learning， and collusion，” Journal of Competition Law & Economics， Vol. 14， No. 4 （2018）， pp. 568-607.
8 王利明：《论数据权益：以“权利束”为视角》，《政治与法律》2022年第7期，第99-113页。
9 冯晓青：《数字经济时代数据产权结构及其制度构建》，《比较法研究》2023年第6期，第16-32页。
10 程啸：《论数据权益》，《国家检察官学院学报》2023年第5期，第77-94页。
11 单晓光：《数据知识产权中国方案的选择》，《人民论坛·学术前沿》2023年第6期，第38-47页。
12 付新华：《企业数据财产权保护论批判——从数据财产权到数据使用权》，《东方法学》2022年第2期，第132-143页。
13 Villalobos P.， Ho A. & Sevilla J. et al.， “Will we run out of data？ limits of LLM scaling based on human-generated data，” in Salakhutdinov R.， Kolter Z. & Heller K. et al. （eds.）， International Conference on Machine Learning （ICML 2024）： Proceedings of Machine Learning Research， vol. 235， Red Hook： Curran Associates， Inc.，2024， pp. 49523-49544.
14 Shumailov I.， Shumaylov Z. & Zhao Y. et al.， “AI models collapse when trained on recursively generated data，” Nature， Vol. 631 （2024）， pp. 755-759.
15 英］约翰·洛克：《政府论》（下），叶启芳、瞿菊农译，北京：商务印书馆，2022年。
16 美］罗伯特·P.莫杰思：《知识产权正当性解释》，金海军、史兆欢、寇海侠译，北京：商务印书馆，2019年。
17 李安：《论企业数据财产权的正当性——以洛克财产权学说为视角》，《科技与法律（中英文）》2022年第1期，第91-100页。
18 梅夏英：《数据的法律属性及其民法定位》，《中国社会科学》2016年第9期，第164-183，209页。
19 陈景辉：《权利可能新兴吗？——新兴权利的两个命题及其批判》，《法制与社会发展》2021年第3期，第90-110页。
20 鞠雪楠、欧阳日辉：《新一代人工智能领域数据要素定价的困境与出路》，《价格理论与实践》2023年第4期，第28-32，96页。
21 申卫星：《论数据用益权》，《中国社会科学》2020年第11期，第110-131，207页。
22 陈星：《数字时代数据产权的理论证成与权利构造》，《法商研究》2023年第6期，第75-88页。
23 锁福涛、潘政皓：《数据财产权的权利证成：以知识产权为参照》，《中国矿业大学学报（社会科学版）》2023年第3期，第61-72页。
24 王利明：《数据何以确权》，《法学研究》2023年第4期，第56-73页。
25 Hallinan D. & Martin N.， “Fundamental rights， the normative keystone of DPIA，” European Data Protection Law Review， Vol. 6， No. 2 （2020）， pp. 178-193.
26 AlMarzouq M.， Zheng L. & Rong G. et al.， “Open source： concepts， benefits， and challenges，” Communications of the Association for Information Systems， Vol. 16， No. 1 （2005）， pp. 756-784.