We study one of the most popular problems in **neurosymbolic learning** (NSL), that of learning neural classifiers given only the result of applying a symbolic component $σ$ to the gold labels of the elements of a vector $\mathbf x$. The gold labels of the elements in $\mathbf x$ are unknown to the learner. We make multiple contributions, theoretical and practical, to address a problem that has not been studied so far in this context, that of characterizing and mitigating *learning imbalances*, i.e., major differences in the errors that occur when classifying instances of different classes (aka **class-specific risks**). Our theoretical analysis reveals a unique phenomenon: that $σ$ can greatly impact learning imbalances. This result sharply contrasts with previous research on supervised and weakly supervised learning, which only studies learning imbalances under data imbalances. On the practical side, we introduce a technique for estimating the marginal of the hidden gold labels using weakly supervised data. Then, we introduce algorithms that mitigate imbalances at training and testing time by treating the marginal of the hidden labels as a constraint. We demonstrate the effectiveness of our techniques using strong baselines from NSL and long-tailed learning, suggesting performance improvements of up to 14%.
翻译:我们研究了**神经符号学习**(NSL)中最常见的问题之一:在仅已知对向量$\mathbf x$中元素的真实标签应用符号组件$σ$的结果时,学习神经分类器。$\mathbf x$中元素的真实标签对学习器是未知的。我们通过理论与实践的多种贡献,解决了该领域中尚未被充分研究的问题——即**学习不平衡**的特征分析与缓解,也就是分类不同类别实例时出现的显著误差差异(亦称**类别特定风险**)。我们的理论分析揭示了一个独特现象:$σ$可能对学习不平衡产生重大影响。这一结果与先前仅研究数据不平衡下学习不平衡的监督学习和弱监督学习研究形成鲜明对比。在实践方面,我们提出了一种利用弱监督数据估计隐藏真实标签边缘分布的技术。随后,我们引入了在训练和测试阶段通过将隐藏标签的边缘分布作为约束条件来缓解不平衡的算法。我们使用NSL和长尾学习领域的强基线方法验证了所提技术的有效性,实验表明性能提升最高可达14%。