As language models are increasingly used in scientific workflows, evaluating their ability to propose sets of explanations-not just a single correct answer-becomes critical. Many scientific problems are underdetermined: multiple, mechanistically distinct hypotheses are consistent with the same observations. We introduce HypoSpace, a diagnostic suite that treats LLMs as samplers of finite hypothesis sets and measures three complementary indicators: Validity (precision of proposals consistent with observations), Uniqueness (non-redundancy among proposals), and Recovery (coverage of the enumerated admissible set). We instantiate HypoSpace in three structured domains with deterministic validators and exactly enumerated hypothesis spaces: (i) causal graphs from perturbations, (ii) gravity-constrained 3D voxel reconstruction from top-down projections, and (iii) Boolean genetic interactions. Across instruction-tuned and reasoning-focused models, Validity often remains high while Uniqueness and Recovery degrade as the admissible space grows, revealing mode collapse that is invisible to correctness-only metrics. HypoSpace offers a controlled probe-rather than a leaderboard-for methods that explicitly explore and cover admissible explanation spaces. Code is available at: https://github.com/CTT-Pavilion/_HypoSpace.
翻译:随着语言模型在科学工作流程中的应用日益广泛,评估其提出解释集合(而非单一正确答案)的能力变得至关重要。许多科学问题是欠定的:多个机制不同的假设与同一观察结果相容。我们提出了HypoSpace,这是一个诊断套件,将LLMs视为有限假设集合的采样器,并测量三个互补指标:有效性(与观察结果一致的提议的精确度)、独特性(提议间的非冗余性)和覆盖度(对已枚举可容许集合的覆盖范围)。我们在三个具有确定性验证器和完全枚举假设空间的结构化领域中实例化了HypoSpace:(i) 基于扰动的因果图,(ii) 基于自上而下投影的重力约束三维体素重建,以及(iii) 布尔遗传相互作用。在指令微调和推理导向的模型中,有效性通常保持较高水平,而独特性和覆盖度随着可容许空间的扩大而下降,这揭示了仅靠正确性指标无法察觉的模式崩溃现象。HypoSpace为那些明确探索和覆盖可容许解释空间的方法提供了一个受控探针,而非排行榜。代码可在以下网址获取:https://github.com/CTT-Pavilion/_HypoSpace。