Transformers often fail to learn generalizable algorithms, instead relying on brittle heuristics. Using graph connectivity as a testbed, we explain this phenomenon both theoretically and empirically. We consider a simplified Transformer architecture, the disentangled Transformer, and prove that an $L$-layer model has capacity to solve for graphs with diameters up to exactly $3^L$, implementing an algorithm equivalent to computing powers of the adjacency matrix. We analyze the training-dynamics, and show that the learned strategy hinges on whether most training instances are within this model capacity. Within-capacity graphs (diameter $\leq 3^L$) drive the learning of a correct algorithmic solution while beyond-capacity graphs drive the learning of a simple heuristic based on node degrees. Finally, we empirically demonstrate that restricting training data within a model's capacity leads to both standard and disentangled transformers learning the exact algorithm rather than the degree-based heuristic.
翻译:Transformer模型往往难以学习可泛化的算法,而是依赖于脆弱的启发式方法。以图连通性为测试平台,我们从理论和实证两方面解释了这一现象。我们考虑一种简化的Transformer架构——解耦Transformer,并证明一个$L$层模型能够精确求解直径不超过$3^L$的图,其实现的算法等价于计算邻接矩阵的幂。我们分析了训练动态,发现模型学习策略的关键在于大多数训练实例是否处于该模型容量范围内。处于容量范围内的图(直径$\leq 3^L$)会驱动模型学习正确的算法解,而超出容量范围的图则会导致模型学习基于节点度的简单启发式方法。最后,我们通过实证证明,将训练数据限制在模型容量范围内,可以使标准Transformer和解耦Transformer都学习到精确算法而非基于节点度的启发式方法。