|
Abstract To stimulate further productive potential of data, China proposed a framework of foundational data rules in December 2022. It is a property-right-based framework, consisting of data circulation, revenue distribution, and security system. Nevertheless, technical logic is inconsistent in the process of data collection, machine learning, and content generation associated with the development of large language models compared to the logic in those foundational data rules. Consequently, the current legislative framework is incapable of resolving numerous data rights and interests disputes due to this logical contradiction. Moreover, the unique attributes that training data requires also make the legal disputes they trigger more complicated and more worthy of research.
To counteract the legal risks following this cutting-edge technology, China has promulgated the “Interim Administrative Measures for Generative AI Services” and is actively engaged in legislative movements about AI. To ensure the development of this technology is more suitable for human safety and morality. It is essential to establish a governance framework that incorporates specific regulations protecting the rights and interests involved in training data. Meanwhile, risk management strategies and other measures must be added to construct a stable and sustainable technological ecosystem. First, the incentive theory provides an interpretation: the rights and interests of training data authorized to be owned can motivate developers to take effective measures to protect them. This can help to address the issue of missing native training data for large-scale model development worldwide, thus relieving the logical conflict between training data development and foundational data rules. Second, according to Locke’s labour theory of property, if the ownership is guaranteed, it can be seen as a recognition of the labor value of model developers. How the value of training data is ascertained is in accordance with how the value of civil property rights is confirmed in legislation.
In the construction of this new type of right, the subject should be the developer who adopts certain algorithms for model training based on the training data. The object of the right should be the training data that satisfies the requirements of legitimacy and originality. On the one hand, the content of the right should also encompass the four powers of traditional rights: holding, using, benefiting, and disposing. On the other hand, the regulatory framework of a new type of rights differs from traditional property rights due to the non-exclusivity, scalability, and iterativity of training data, and it heavily relies on technological enforcement. Once the fundamental framework has been established, the right should be aligned with multiple interests, including public interests, personality rights, and patents to build up internal consistency within the legal system and maximize external social utility.
Ultimately, based on the right that has been established, it is necessary to construct a domestic open-source data standard system for large-scale models, ensuring the legitimate rights and interests of all participants and preventing the monopolization and misuse of training data in open-source sharing.
|
|
Published: 20 March 2026
|
|
|
|