Artificial Intelligence, especially Large Language Models (LLMs), has transformed domains such as software engineering, journalism, creative writing, academia, and media (Naveed et al. 2025; arXiv:2307.06435). Diffusion models like Stable Diffusion generate high-quality images and videos from text. Evidence shows rapid expansion: 74.2% of newly published webpages now contain AI-generated material (Ryan Law 2025), 30-40% of the active web corpus is synthetic (Spennemann 2025; arXiv:2504.08755), 52% of U.S. adults use LLMs for writing, coding, or research (Staff 2025), and audits find AI involvement in 18% of financial complaints and 24% of press releases (Liang et al. 2025). The underlying neural architectures, including Transformers (Vaswani et al. 2023; arXiv:1706.03762), RNNs, LSTMs, GANs, and diffusion networks, depend on large, diverse, human-authored datasets (Shi & Iyengar 2019). As synthetic content dominates, recursive training risks eroding linguistic and semantic diversity, producing Model Collapse (Shumailov et al. 2024; arXiv:2307.15043; Dohmatob et al. 2024; arXiv:2402.07712). This study quantifies and forecasts collapse onset by examining year-wise semantic similarity in English-language Wikipedia (filtered Common Crawl) from 2013 to 2025 using Transformer embeddings and cosine similarity metrics. Results reveal a steady rise in similarity before public LLM adoption, likely driven by early RNN/LSTM translation and text-normalization pipelines, though modest due to a smaller scale. Observed fluctuations reflect irreducible linguistic diversity, variable corpus size across years, finite sampling error, and an exponential rise in similarity after the public adoption of LLM models. These findings provide a data-driven estimate of when recursive AI contamination may significantly threaten data richness and model generalization.
翻译:人工智能,特别是大型语言模型(LLMs),已经改变了软件工程、新闻业、创意写作、学术界和媒体等领域(Naveed et al. 2025; arXiv:2307.06435)。像Stable Diffusion这样的扩散模型能够从文本生成高质量的图像和视频。证据显示其快速扩张:74.2%的新发布网页现在包含AI生成内容(Ryan Law 2025),30-40%的活跃网络语料是合成的(Spennemann 2025; arXiv:2504.08755),52%的美国成年人使用LLMs进行写作、编码或研究(Staff 2025),审计发现AI参与了18%的金融投诉和24%的新闻稿(Liang et al. 2025)。底层的神经架构,包括Transformers(Vaswani et al. 2023; arXiv:1706.03762)、RNNs、LSTMs、GANs和扩散网络,依赖于大规模、多样化的人类创作数据集(Shi & Iyengar 2019)。随着合成内容占据主导,递归训练可能侵蚀语言和语义多样性,导致模型崩溃(Shumailov et al. 2024; arXiv:2307.15043; Dohmatob et al. 2024; arXiv:2402.07712)。本研究通过使用Transformer嵌入和余弦相似度度量,分析2013年至2025年英语维基百科(过滤后的Common Crawl语料)的逐年语义相似性,量化和预测崩溃的发生。结果显示,在公众采用LLMs之前,相似性稳步上升,这可能是由早期RNN/LSTM翻译和文本规范化流程驱动的,但由于规模较小,上升幅度有限。观察到的波动反映了不可约的语言多样性、不同年份语料库大小的变化、有限的抽样误差,以及公众采用LLM模型后相似性的指数级增长。这些发现为递归AI污染何时可能显著威胁数据丰富性和模型泛化能力提供了数据驱动的估计。