We introduce a method to determine if a certain capability helps to achieve an accurate model of given data. We view labels as being generated from the inputs by a program composed of subroutines with different capabilities, and we posit that a subroutine is useful if and only if the minimal program that invokes it is shorter than the one that does not. Since minimum program length is uncomputable, we instead estimate the labels' minimum description length (MDL) as a proxy, giving us a theoretically-grounded method for analyzing dataset characteristics. We call the method Rissanen Data Analysis (RDA) after the father of MDL, and we showcase its applicability on a wide variety of settings in NLP, ranging from evaluating the utility of generating subquestions before answering a question, to analyzing the value of rationales and explanations, to investigating the importance of different parts of speech, and uncovering dataset gender bias.
翻译:我们引入了一种方法来确定某种能力是否有助于实现特定数据的准确模型。 我们认为标签是由一个由具有不同能力的子例程组成的子例程组成的程序产生的,并且我们假设,只有当援引子例程的最小程序短于不短于最低程序程时,子例程才有用。 由于最小程序长度无法计算,我们选择了将标签的最低描述长度(MDL)作为代理,给我们提供了一个基于理论的分析数据集特征的方法。 我们称Rissanen数据分析法(Rissanen数据分析法)为MDL之父之后的“Rissanen数据分析法(RDA)”为“Risanen数据分析法(RDA)”为“Risanen数据分析法(RDA ) ”, 并且我们展示了该方法在NLP中各种环境的可适用性,从在回答问题之前评估产生子题的效用,到分析理由和解释的价值,调查不同部分言论的重要性,以及发现数据性别偏差等。