Evaluating the naturalness of dialogue in language models (LMs) is not trivial: notions of 'naturalness' vary, and scalable quantitative metrics remain limited. This study leverages the linguistic notion of 'at-issueness' to assess dialogue naturalness and introduces a new method: Divide, Generate, Recombine, and Compare (DGRC). DGRC (i) divides a dialogue as a prompt, (ii) generates continuations for subparts using LMs, (iii) recombines the dialogue and continuations, and (iv) compares the likelihoods of the recombined sequences. This approach mitigates bias in linguistic analyses of LMs and enables systematic testing of discourse-sensitive behavior. Applying DGRC, we find that LMs prefer to continue dialogue on at-issue content, with this effect enhanced in instruct-tuned models. They also reduce their at-issue preference when relevant cues (e.g., "Hey, wait a minute") are present. Although instruct-tuning does not further amplify this modulation, the pattern reflects a hallmark of successful dialogue dynamics.
翻译:评估语言模型(LMs)中对话的自然性并非易事:'自然性'的概念因人而异,且可扩展的定量度量方法仍然有限。本研究利用语言学中的'议题性'概念来评估对话自然性,并提出了一种新方法:分割、生成、重组与比较(DGRC)。DGRC(i)将对话作为提示进行分割,(ii)使用LMs为子部分生成续写内容,(iii)重组对话与续写内容,以及(iv)比较重组序列的似然性。该方法减轻了LMs语言分析中的偏差,并支持对语篇敏感行为的系统性测试。应用DGRC后,我们发现LMs更倾向于在议题性内容上延续对话,这一效应在指令调优模型中更为显著。当存在相关提示(例如“嘿,稍等一下”)时,它们会降低对议题性内容的偏好。尽管指令调优并未进一步放大这种调节作用,但该模式反映了成功对话动态的一个典型特征。