This paper introduces a reinforcement learning framework that enables controllable and diverse player behaviors without relying on human gameplay data. Existing approaches often require large-scale player trajectories, train separate models for different player types, or provide no direct mapping between interpretable behavioral parameters and the learned policy, limiting their scalability and controllability. We define player behavior in an N-dimensional continuous space and uniformly sample target behavior vectors from a region that encompasses the subset representing real human styles. During training, each agent receives both its current and target behavior vectors as input, and the reward is based on the normalized reduction in distance between them. This allows the policy to learn how actions influence behavioral statistics, enabling smooth control over attributes such as aggressiveness, mobility, and cooperativeness. A single PPO-based multi-agent policy can reproduce new or unseen play styles without retraining. Experiments conducted in a custom multi-player Unity game show that the proposed framework produces significantly greater behavioral diversity than a win-only baseline and reliably matches specified behavior vectors across diverse targets. The method offers a scalable solution for automated playtesting, game balancing, human-like behavior simulation, and replacing disconnected players in online games.
翻译:本文提出一种强化学习框架,能够在无需依赖人类游戏数据的情况下实现可控制且多样化的玩家行为。现有方法通常需要大规模玩家轨迹数据、为不同玩家类型训练独立模型,或无法在可解释的行为参数与学习策略之间建立直接映射,从而限制了其可扩展性与可控性。我们将玩家行为定义于N维连续空间,并从涵盖真实人类风格子集的区域中均匀采样目标行为向量。训练过程中,每个智能体同时接收当前行为向量与目标行为向量作为输入,其奖励基于两者间距离的归一化缩减值计算。这使得策略能够学习动作如何影响行为统计特征,从而实现对攻击性、移动性与协作性等属性的平滑控制。基于PPO的单一多智能体策略无需重新训练即可复现全新或未见过的游戏风格。在定制多玩家Unity游戏中进行的实验表明,所提框架产生的行为多样性显著优于仅以胜利为目标的基线方法,并能可靠地匹配多样化目标下的指定行为向量。该方法为自动化游戏测试、游戏平衡调整、类人行为模拟以及在线游戏中掉线玩家替换提供了可扩展的解决方案。