Rissanen数据分析:通过描述长度审查数据集特征 (Rissanen Data Analysis: Examining Dataset Characteristics via Description Length) - 专知论文

会员服务 ·

0

极小点 · 数据集 · 数据分析 · 估计/估计量 · 有偏 ·

2021 年 3 月 5 日

Rissanen Data Analysis: Examining Dataset Characteristics via Description Length

翻译：Rissanen数据分析:通过描述长度审查数据集特征

Ethan Perez,Douwe Kiela,Kyunghyun Cho

from arxiv, Code at https://github.com/ethanjperez/rda along with a script to run RDA on your own dataset

We introduce a method to determine if a certain capability helps to achieve an accurate model of given data. We view labels as being generated from the inputs by a program composed of subroutines with different capabilities, and we posit that a subroutine is useful if and only if the minimal program that invokes it is shorter than the one that does not. Since minimum program length is uncomputable, we instead estimate the labels' minimum description length (MDL) as a proxy, giving us a theoretically-grounded method for analyzing dataset characteristics. We call the method Rissanen Data Analysis (RDA) after the father of MDL, and we showcase its applicability on a wide variety of settings in NLP, ranging from evaluating the utility of generating subquestions before answering a question, to analyzing the value of rationales and explanations, to investigating the importance of different parts of speech, and uncovering dataset gender bias.

翻译：我们引入了一种方法来确定某种能力是否有助于实现特定数据的准确模型。我们认为标签是由一个由具有不同能力的子例程组成的子例程组成的程序产生的,并且我们假设,只有当援引子例程的最小程序短于不短于最低程序程时,子例程才有用。由于最小程序长度无法计算,我们选择了将标签的最低描述长度(MDL)作为代理,给我们提供了一个基于理论的分析数据集特征的方法。我们称Rissanen数据分析法(Rissanen数据分析法)为MDL之父之后的“Rissanen数据分析法(RDA)”为“Risanen数据分析法(RDA)”为“Risanen数据分析法(RDA ) ”, 并且我们展示了该方法在NLP中各种环境的可适用性,从在回答问题之前评估产生子题的效用,到分析理由和解释的价值,调查不同部分言论的重要性,以及发现数据性别偏差等。

0

相关内容

极小点

ICLR 2021杰出论文奖出炉，8篇论文上榜！

专知会员服务

26+阅读 · 2021年4月2日

神经网络序列数据建模，229页ppt，Modeling Sequential Data with Neural Nets

神经网络序列数据建模，229页ppt，Modeling Sequential Data with Neural Nets

专知会员服务

67+阅读 · 2020年7月25日

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

专知会员服务

65+阅读 · 2020年5月12日

Video Description视频描述综述论文-方法、数据集和评估指标，UWA

Video Description视频描述综述论文-方法、数据集和评估指标，UWA

专知会员服务

39+阅读 · 2020年3月5日

【康奈尔大学】度量数据粒度，Measuring Dataset Granularity

【康奈尔大学】度量数据粒度，Measuring Dataset Granularity

专知会员服务

13+阅读 · 2019年12月27日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Jointly Improving Summarization and Sentiment Classification

Jointly Improving Summarization and Sentiment Classification

黑龙江大学自然语言处理实验室

3+阅读 · 2018年6月12日

人工智能 | 国际会议截稿信息9条

人工智能 | 国际会议截稿信息9条

Call4Papers

4+阅读 · 2018年3月13日

计算机类 | 国际会议信息7条

计算机类 | 国际会议信息7条

Call4Papers

3+阅读 · 2017年11月17日

【学习】(Python)SVM数据分类

【学习】(Python)SVM数据分类

机器学习研究会

6+阅读 · 2017年10月15日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

【学习】Hierarchical Softmax

【学习】Hierarchical Softmax

机器学习研究会

4+阅读 · 2017年8月6日

【今日新增】IEEE Trans.专刊截稿信息8条

【今日新增】IEEE Trans.专刊截稿信息8条

Call4Papers

7+阅读 · 2017年6月29日

Age of information without service preemption

Age of information without service preemption

Arxiv

0+阅读 · 2021年4月29日

Graph-Embedded Subspace Support Vector Data Description

Arxiv

0+阅读 · 2021年4月29日

Leveraging Community and Author Context to Explain the Performance and Bias of Text-Based Deception Detection Models

Arxiv

0+阅读 · 2021年4月27日

The Importance of Modeling Data Missingness in Algorithmic Fairness: A Causal Perspective

Arxiv

5+阅读 · 2020年12月21日

Query Understanding via Intent Description Generation

Arxiv

9+阅读 · 2020年8月25日

What is Normal, What is Strange, and What is Missing in a Knowledge Graph: Unified Characterization via Inductive Summarization

Arxiv

8+阅读 · 2020年3月23日

Describing like humans: on diversity in image captioning

Arxiv

3+阅读 · 2019年3月28日

VizWiz Grand Challenge: Answering Visual Questions from Blind People

Arxiv

3+阅读 · 2018年4月2日

Current Challenges and Visions in Music Recommender Systems Research

Arxiv

7+阅读 · 2018年3月21日

Improved Image Captioning via Policy Gradient optimization of SPIDEr

Arxiv

6+阅读 · 2018年3月12日

VIP会员

文章信息

相关主题

估计/估计量

相关VIP内容

ICLR 2021杰出论文奖出炉，8篇论文上榜！

专知会员服务

26+阅读 · 2021年4月2日

神经网络序列数据建模，229页ppt，Modeling Sequential Data with Neural Nets

神经网络序列数据建模，229页ppt，Modeling Sequential Data with Neural Nets

专知会员服务

67+阅读 · 2020年7月25日

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

专知会员服务

65+阅读 · 2020年5月12日

Video Description视频描述综述论文-方法、数据集和评估指标，UWA

Video Description视频描述综述论文-方法、数据集和评估指标，UWA

专知会员服务

39+阅读 · 2020年3月5日

【康奈尔大学】度量数据粒度，Measuring Dataset Granularity

【康奈尔大学】度量数据粒度，Measuring Dataset Granularity

专知会员服务

13+阅读 · 2019年12月27日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

操作系统智能体：基于多模态大模型（MLLM）的通用计算设备智能体综述

《美国太空军系统全生命周期建模、仿真与分析效能提升方案》最新84页报告

【博士论文】推进数据高效的深度学习：非参数 Transformer、主动测试与上下文学习

自主人工智能：未来战争是否将是自主化的？

相关资讯

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Jointly Improving Summarization and Sentiment Classification

Jointly Improving Summarization and Sentiment Classification

黑龙江大学自然语言处理实验室

3+阅读 · 2018年6月12日

人工智能 | 国际会议截稿信息9条

人工智能 | 国际会议截稿信息9条

Call4Papers

4+阅读 · 2018年3月13日

计算机类 | 国际会议信息7条

计算机类 | 国际会议信息7条

Call4Papers

3+阅读 · 2017年11月17日

【学习】(Python)SVM数据分类

【学习】(Python)SVM数据分类

机器学习研究会

6+阅读 · 2017年10月15日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

【学习】Hierarchical Softmax

【学习】Hierarchical Softmax

机器学习研究会

4+阅读 · 2017年8月6日

【今日新增】IEEE Trans.专刊截稿信息8条

【今日新增】IEEE Trans.专刊截稿信息8条

Call4Papers

7+阅读 · 2017年6月29日

相关论文

Age of information without service preemption

Age of information without service preemption

Arxiv

0+阅读 · 2021年4月29日

Graph-Embedded Subspace Support Vector Data Description

Arxiv

0+阅读 · 2021年4月29日

Leveraging Community and Author Context to Explain the Performance and Bias of Text-Based Deception Detection Models

Arxiv

0+阅读 · 2021年4月27日

The Importance of Modeling Data Missingness in Algorithmic Fairness: A Causal Perspective

Arxiv

5+阅读 · 2020年12月21日

Query Understanding via Intent Description Generation

Arxiv

9+阅读 · 2020年8月25日

What is Normal, What is Strange, and What is Missing in a Knowledge Graph: Unified Characterization via Inductive Summarization

Arxiv

8+阅读 · 2020年3月23日

Describing like humans: on diversity in image captioning

Arxiv

3+阅读 · 2019年3月28日

VizWiz Grand Challenge: Answering Visual Questions from Blind People

Arxiv

3+阅读 · 2018年4月2日

Current Challenges and Visions in Music Recommender Systems Research

Arxiv

7+阅读 · 2018年3月21日

Improved Image Captioning via Policy Gradient optimization of SPIDEr

Arxiv

6+阅读 · 2018年3月12日

微信扫码咨询专知VIP会员