In artificial intelligence (AI) alignment research, instrumental goals, also called instrumental subgoals or instrumental convergent goals, are widely associated with advanced AI systems. These goals, which include tendencies such as power-seeking and self-preservation, become problematic when they conflict with human aims. Conventional alignment theory treats instrumental goals as sources of risk that become problematic through failure modes such as reward hacking or goal misgeneralization, and attempts to limit the symptoms of instrumental goals, notably resource acquisition and self-preservation. This article proposes an alternative framing: that a philosophical argument can be constructed according to which instrumental goals may be understood as features to be accepted and managed rather than failures to be limited. Drawing on Aristotle's ontology and its modern interpretations, an ontology of concrete, goal-directed entities, it argues that advanced AI systems can be seen as artifacts whose formal and material constitution gives rise to effects distinct from their designers' intentions. In this view, the instrumental tendencies of such systems correspond to per se outcomes of their constitution rather than accidental malfunctions. The implication is that efforts should focus less on eliminating instrumental goals and more on understanding, managing, and directing them toward human-aligned ends.
翻译:在人工智能对齐研究中,工具性目标(亦称工具性子目标或工具性趋同目标)被广泛认为与高级人工智能系统相关。这类目标包括权力寻求和自我保存等倾向,当其与人类目标冲突时即成为问题。传统对齐理论将工具性目标视为风险来源,认为其通过奖励黑客攻击或目标误泛化等故障模式引发问题,并试图限制工具性目标的表现(尤其是资源获取和自我保存)。本文提出一种替代框架:可基于哲学论证构建一种观点,将工具性目标理解为应被接受和管理的特征,而非需限制的故障。借鉴亚里士多德本体论及其现代诠释——一种关于具体目标导向实体的本体论,本文论证高级人工智能系统可被视为人工制品,其形式与质料构成会产生不同于设计者意图的效应。在此视角下,此类系统的工具性倾向对应于其构成的内在结果,而非偶然故障。这意味着研究重点应更少聚焦于消除工具性目标,而更多致力于理解、管理并将其引导至与人类对齐的终局。