A large language model's (LLM's) out-of-distribution (OOD) generalisation ability is crucial to its deployment. Previous work assessing LLMs' generalisation performance, however, typically focuses on a single out-of-distribution dataset. This approach may fail to precisely evaluate the capabilities of the model, as the data shifts encountered once a model is deployed are much more diverse. In this work, we investigate whether OOD generalisation results generalise. More specifically, we evaluate a model's performance across multiple OOD testsets throughout a finetuning run; we then evaluate the partial correlation of performances across these testsets, regressing out in-domain performance. This allows us to assess how correlated are generalisation performances once in-domain performance is controlled for. Analysing OLMo2 and OPT, we observe no overarching trend in generalisation results: the existence of a positive or negative correlation between any two OOD testsets depends strongly on the specific choice of model analysed.
翻译:大型语言模型(LLM)的分布外(OOD)泛化能力对其实际部署至关重要。然而,先前评估LLM泛化性能的研究通常仅关注单一的分布外数据集。这种方法可能无法精确评估模型的能力,因为模型部署后遇到的数据偏移要多样得多。在本研究中,我们探究OOD泛化结果是否具有普适性。具体而言,我们在微调过程中评估模型在多个OOD测试集上的性能;随后,通过回归剔除域内性能的影响,计算这些测试集性能之间的偏相关性。这使我们能够在控制域内性能后,评估泛化性能之间的相关性程度。通过对OLMo2和OPT模型的分析,我们观察到泛化结果中不存在统一的趋势:任意两个OOD测试集之间是否存在正相关或负相关,强烈依赖于所分析的具体模型选择。