To handle underspecified or ambiguous queries, AI assistants need a policy for managing their uncertainty to determine (a) when to guess the user intent and answer directly, (b) when to enumerate and answer multiple possible intents, and (c) when to ask a clarifying question. However, such policies are contextually dependent on factors such as user preferences or modality. For example, enumerating multiple possible user intentions is cumbersome on small screens or in a voice setting. In this work, we propose to train steerable policies for managing this uncertainty using self-play. Given two agents, one simulating a user and the other an AI assistant, we generate conversations where the user issues a potentially ambiguous query, and the assistant needs to determine how to respond. Importantly, the model takes as input the numerical cost of each clarification question, and each generated word, and is asked to take the action that will maximize its final reward, which is the cost-penalized accuracy. We use Reinforced Self-Training (ReST) to train our model to achieve high reward and show this leads to a steerable policy that changes its behavior predictably conditioned on the provided costs, leading to higher reward and accuracy. Moreover, our procedure also generalizes to numerical cost values that were unobserved at training time.
翻译:为处理不明确或模糊的查询,AI助手需要一种管理不确定性的策略,以决定(a)何时推测用户意图并直接回答,(b)何时列举并回答多种可能的意图,以及(c)何时提出澄清性问题。然而,此类策略在上下文上依赖于用户偏好或交互模式等因素。例如,在小屏幕或语音交互场景中,列举多种可能的用户意图会显得繁琐。在本研究中,我们提出通过自博弈训练可调控的策略来管理这种不确定性。给定两个智能体,一个模拟用户,另一个模拟AI助手,我们生成对话场景:用户提出可能模糊的查询,助手需决定如何响应。关键之处在于,模型以每个澄清问题的数值成本和每个生成词语的数值成本作为输入,并被要求采取能最大化最终奖励(即经成本惩罚后的准确率)的行动。我们采用强化自训练(ReST)方法训练模型以实现高奖励,结果表明该方法可产生可调控的策略:该策略能根据提供的成本参数可预测地调整其行为,从而实现更高的奖励和准确率。此外,我们的训练流程还能泛化至训练时未观察到的数值成本参数。