The paper addresses two critical challenges in language model (LM) evaluation: creating reliable domain-specific benchmarks and understanding knowledge representation during domain adaptation. We introduce a deterministic pipeline that converts raw domain corpora into completion-type benchmarks without relying on LMs or human curation, eliminating benchmark contamination issues while enabling evaluation on the latest domain data. Our approach generates domain-specific keywords and related word lists using TF and Term TF-IDF methods and constructs prompt-target pairs. We evaluate models by measuring their ability to complete these prompts with the correct domain-specific targets, providing a direct assessment of domain knowledge with low computational cost. Through comprehensive experiments across multiple models (GPT-2 medium/XL, Llama-2/3.1, OLMo-2, Qwen-2, Mistral) and domains, we demonstrate that our benchmark strongly correlates with expert-generated benchmarks while providing a more accurate measure of domain knowledge than traditional perplexity metrics. We reveal that domain adaptation happens rapidly in smaller models (within 500 steps) and illustrate a new approach to domain knowledge evaluation in base models during training for early stopping. By extending mechanistic analysis to domain adaptation, we discover that initial-to-mid layers are primarily responsible for attribute extraction, while later layers focus on next token prediction. Furthermore, we show that during adaptation, forgetting begins in the middle layers, where attribute extraction happens and is amplified in later layers. Our work provides both a practical evaluation methodology for domain-specific LMs and novel insights into knowledge representation during adaptation, with implications for more efficient fine-tuning strategies and targeted approaches to mitigate catastrophic forgetting.
翻译:本文针对语言模型评估中的两个关键挑战:创建可靠的领域特定基准测试,以及理解领域适应过程中的知识表征。我们提出了一种确定性流程,能够将原始领域语料库转换为补全型基准测试,无需依赖语言模型或人工标注,从而消除基准污染问题,同时支持对最新领域数据的评估。该方法通过TF和Term TF-IDF方法生成领域特定关键词及相关词表,并构建提示-目标对。我们通过测量模型使用正确领域特定目标完成这些提示的能力来评估模型,以较低计算成本直接评估领域知识。通过对多个模型(GPT-2 medium/XL、Llama-2/3.1、OLMo-2、Qwen-2、Mistral)和领域的综合实验,我们证明所提出的基准测试与专家生成的基准测试具有强相关性,同时比传统的困惑度指标更能准确衡量领域知识。我们发现领域适应在较小模型中发生迅速(500步以内),并提出了一种在基础模型训练过程中进行领域知识评估以实现早停的新方法。通过将机制分析扩展到领域适应过程,我们发现初始至中间层主要负责属性提取,而后续层则专注于下一词元预测。此外,我们证明在适应过程中,遗忘始于发生属性提取的中间层,并在后续层中被放大。本工作既为领域特定语言模型提供了实用的评估方法,也为适应过程中的知识表征提供了新的见解,对实现更高效的微调策略和缓解灾难性遗忘的针对性方法具有重要意义。