Safety-aligned language models often refuse prompts that are actually harmless. Current evaluations mostly report global rates such as false rejection or compliance. These scores treat each prompt alone and miss local inconsistency, where a model accepts one phrasing of an intent but rejects a close paraphrase. This gap limits diagnosis and tuning. We introduce "semantic confusion," a failure mode that captures such local inconsistency, and a framework to measure it. We build ParaGuard, a 10k-prompt corpus of controlled paraphrase clusters that hold intent fixed while varying surface form. We then propose three model-agnostic metrics at the token level: Confusion Index, Confusion Rate, and Confusion Depth. These metrics compare each refusal to its nearest accepted neighbors and use token embeddings, next-token probabilities, and perplexity signals. Experiments across diverse model families and deployment guards show that global false-rejection rate hides critical structure. Our metrics reveal globally unstable boundaries in some settings, localized pockets of inconsistency in others, and cases where stricter refusal does not increase inconsistency. We also show how confusion-aware auditing separates how often a system refuses from how sensibly it refuses. This gives developers a practical signal to reduce false refusals while preserving safety.
翻译:安全对齐的语言模型经常拒绝实际上无害的提示。当前的评估主要报告诸如错误拒绝率或合规率等全局指标。这些分数单独处理每个提示,忽略了局部不一致性,即模型接受某种意图的表述却拒绝其相近的释义。这一差距限制了诊断与调优。我们引入“语义混淆”这一故障模式来捕捉此类局部不一致性,并提出一个测量框架。我们构建了ParaGuard,一个包含一万个提示的语料库,由受控的释义簇组成,这些簇在保持意图不变的同时改变表层形式。随后,我们提出了三种模型无关的、基于词元级别的度量指标:混淆指数、混淆率和混淆深度。这些指标通过比较每次拒绝与其最近被接受的邻居,并利用词元嵌入、下一词元概率和困惑度信号进行计算。在不同模型系列和部署防护机制上的实验表明,全局错误拒绝率掩盖了关键的结构信息。我们的度量指标揭示了在某些设置中存在全局不稳定的决策边界,在其他设置中存在局部不一致性区域,以及更严格的拒绝并未增加不一致性的情况。我们还展示了具备混淆感知的审计如何将系统拒绝的频率与其拒绝的合理性分离开来。这为开发者提供了一个实用的信号,用以在保持安全性的同时减少错误拒绝。