The rapid advancement of general-purpose AI models has increased concerns about copyright infringement in training data, yet current regulatory frameworks remain predominantly reactive rather than proactive. This paper examines the regulatory landscape of AI training data governance in major jurisdictions, including the EU, the United States, and the Asia-Pacific region. It also identifies critical gaps in enforcement mechanisms that threaten both creator rights and the sustainability of AI development. Through analysis of major cases we identified critical gaps in pre-training data filtering. Existing solutions such as transparency tools, perceptual hashing, and access control mechanisms address only specific aspects of the problem and cannot prevent initial copyright violations. We identify two fundamental challenges: pre-training license collection and content filtering, which faces the impossibility of comprehensive copyright management at scale, and verification mechanisms, which lack tools to confirm filtering prevented infringement. We propose a multilayered filtering pipeline that combines access control, content verification, machine learning classifiers, and continuous database cross-referencing to shift copyright protection from post-training detection to pre-training prevention. This approach offers a pathway toward protecting creator rights while enabling continued AI innovation.


翻译:通用人工智能模型的快速发展加剧了人们对训练数据中版权侵权的担忧,然而当前的监管框架仍以被动应对为主,而非主动预防。本文考察了包括欧盟、美国和亚太地区在内的主要司法管辖区在人工智能训练数据治理方面的监管格局,并指出了执法机制中存在的关键缺陷,这些缺陷既威胁创作者权利,也危及人工智能发展的可持续性。通过对主要案例的分析,我们发现了训练前数据过滤方面的关键不足。现有解决方案,如透明度工具、感知哈希和访问控制机制,仅能解决该问题的特定方面,无法防止初始的版权侵权行为。我们识别出两个根本性挑战:一是训练前许可收集与内容过滤,其面临大规模全面版权管理的不可能性;二是验证机制,其缺乏能够确认过滤已防止侵权的工具。我们提出了一种多层过滤管道,该管道结合了访问控制、内容验证、机器学习分类器以及持续的数据库交叉比对,旨在将版权保护从训练后检测转向训练前预防。这一方法为在保护创作者权利的同时,继续推动人工智能创新提供了一条可行路径。

0
下载
关闭预览

相关内容

专知会员服务
35+阅读 · 2021年6月8日
【CPS】社会物理信息系统(CPSS)及其典型应用
产业智能官
16+阅读 · 2018年9月18日
国家自然科学基金
2+阅读 · 2014年12月31日
VIP会员
Top
微信扫码咨询专知VIP会员