Developing and validating psychometric scales requires large samples, multiple testing phases, and substantial resources. Recent advances in Large Language Models (LLMs) enable the generation of synthetic participant data by prompting models to answer items while impersonating individuals of specific demographic profiles, potentially allowing in silico piloting before real data collection. Across four preregistered studies (N = circa 300 each), we tested whether LLM-simulated datasets can reproduce the latent structures and measurement properties of human responses. In Studies 1-2, we compared LLM-generated data with real datasets for two validated scales; in Studies 3-4, we created new scales using EFA on simulated data and then examined whether these structures generalized to newly collected human samples. Simulated datasets replicated the intended factor structures in three of four studies and showed consistent configural and metric invariance, with scalar invariance achieved for the two newly developed scales. However, correlation-based tests revealed substantial differences between real and synthetic datasets, and notable discrepancies appeared in score distributions and variances. Thus, while LLMs capture group-level latent structures, they do not approximate individual-level data properties. Simulated datasets also showed full internal invariance across gender. Overall, LLM-generated data appear useful for early-stage, group-level psychometric prototyping, but not as substitutes for individual-level validation. We discuss methodological limitations, risks of bias and data pollution, and ethical considerations related to in silico psychometric simulations.
翻译:心理测量量表的开发与验证需要大规模样本、多阶段测试及大量资源。大型语言模型(LLMs)的最新进展使得通过提示模型模拟特定人口特征个体回答量表条目来生成合成参与者数据成为可能,这为真实数据收集前的计算机模拟预研提供了潜在途径。通过四项预先注册的研究(每项研究样本量N≈300),我们检验了LLM模拟数据集能否复现人类反应的潜在结构与测量特性。在研究1-2中,我们针对两个已验证量表比较了LLM生成数据与真实数据集;在研究3-4中,我们利用模拟数据进行探索性因子分析(EFA)创建新量表,随后检验这些结构能否推广至新收集的人类样本。模拟数据集在四项研究中的三项成功复现了预设因子结构,并表现出稳定的构型与度量等值性,两个新开发量表还实现了标量等值性。然而,基于相关性的检验揭示了真实数据集与合成数据集间的显著差异,在分数分布与方差方面也出现明显不一致。因此,尽管LLMs能够捕捉群体层面的潜在结构,但无法近似个体层面的数据特性。模拟数据集还显示出跨性别的完全内部等值性。总体而言,LLM生成数据在群体层面的心理测量原型开发早期阶段具有应用价值,但不能替代个体层面的验证。我们讨论了计算机心理测量模拟的方法学局限、偏见与数据污染风险,以及相关的伦理考量。