在大数据框架内进行材料化的成本基存储格式选择器 (A Cost-based Storage Format Selector for Materialization in Big Data Frameworks) - 专知论文

会员服务 ·

0

Storage · Avro · Hadoop · Performer · Better ·

2018 年 6 月 11 日

A Cost-based Storage Format Selector for Materialization in Big Data Frameworks

翻译：在大数据框架内进行材料化的成本基存储格式选择器

Rana Faisal Munir,Alberto Abelló,Oscar Romero,Maik Thiele,Wolfgang Lehner

Modern big data frameworks (such as Hadoop and Spark) allow multiple users to do large-scale analysis simultaneously. Typically, users deploy Data-Intensive Workflows (DIWs) for their analytical tasks. These DIWs of different users share many common parts (i.e, 50-80%), which can be materialized to reuse them in future executions. The materialization improves the overall processing time of DIWs and also saves computational resources. Current solutions for materialization store data on Distributed File Systems (DFS) by using a fixed data format. However, a fixed choice might not be the optimal one for every situation. For example, it is well-known that different data fragmentation strategies (i.e., horizontal, vertical or hybrid) behave better or worse according to the access patterns of the subsequent operations. In this paper, we present a cost-based approach which helps deciding the most appropriate storage format in every situation. A generic cost-based storage format selector framework considering the three fragmentation strategies is presented. Then, we use our framework to instantiate cost models for specific Hadoop data formats (namely SequenceFile, Avro and Parquet), and test it with realistic use cases. Our solution gives on average 33% speedup over SequenceFile, 11% speedup over Avro, 32% speedup over Parquet, and overall, it provides upto 25% performance gain.

翻译：现代大数据框架( 如 Hadoop 和 Spark ) 允许多个用户同时进行大规模分析。通常, 用户部署数据密集工作流程( DIWs ) 以完成分析任务。这些不同用户的DIW 共享许多共同部分( 即, 50- 80 % ), 可以实现, 在未来执行时再使用它们。物质化提高了 DIW 的总体处理时间, 并节省了计算资源。目前, 使用固定数据格式, 在分布式文件系统( DFS) 上进行物质化存储数据的方法。但是, 固定的选择可能不是每种情况下的最佳选择。例如, 众所周知, 不同的数据碎裂战略( 即, 水平、垂直或混合) 与随后操作的存取模式不同。在本文中, 我们提出了一个基于成本的方法, 帮助决定每种情况下最合适的存储格式。一种基于成本的存储格式选择框架, 考虑三种碎裂战略。然后, 我们使用我们的框架将快速成本模型用于特定的 Hadopopopopopy Paly Pal- passquequequel 和 Avoquest viquel 。

0

相关内容

Storage

Storage

【微软-ACL2020】TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

【微软-ACL2020】TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

专知会员服务

36+阅读 · 2020年4月14日

经济学中的数据科学，Data Science in Economics，附22页pdf

经济学中的数据科学，Data Science in Economics，附22页pdf

专知会员服务

36+阅读 · 2020年4月1日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【大规模数据系统，552页ppt】Large-scale Data Systems

【大规模数据系统，552页ppt】Large-scale Data Systems

专知会员服务

61+阅读 · 2019年12月21日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

开源书：PyTorch深度学习起步

开源书：PyTorch深度学习起步

专知会员服务

51+阅读 · 2019年10月11日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

人工智能 | SCI期刊专刊信息3条

人工智能 | SCI期刊专刊信息3条

Call4Papers

5+阅读 · 2019年1月10日

Ray RLlib: Scalable 降龙十八掌

Ray RLlib: Scalable 降龙十八掌

CreateAMind

9+阅读 · 2018年12月28日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

Hierarchical Imitation - Reinforcement Learning

Hierarchical Imitation - Reinforcement Learning

CreateAMind

19+阅读 · 2018年5月25日

条件GAN重大改进！cGANs with Projection Discriminator

条件GAN重大改进！cGANs with Projection Discriminator

CreateAMind

8+阅读 · 2018年2月7日

分布式TensorFlow入门指南

分布式TensorFlow入门指南

机器学习研究会

4+阅读 · 2017年11月28日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

Semantics of Data Mining Services in Cloud Computing

Semantics of Data Mining Services in Cloud Computing

Arxiv

4+阅读 · 2018年10月5日

Multilingual Sentiment Analysis: An RNN-Based Framework for Limited Data

Arxiv

12+阅读 · 2018年6月8日

BigDL: A Distributed Deep Learning Framework for Big Data

Arxiv

4+阅读 · 2018年4月16日

Deep Communicating Agents for Abstractive Summarization

Arxiv

5+阅读 · 2018年3月27日

CuLDA_CGS: Solving Large-scale LDA Problems on GPUs

Arxiv

3+阅读 · 2018年3月13日

The Search Problem in Mixture Models

Arxiv

3+阅读 · 2018年2月24日

What is Wrong with Topic Modeling? (and How to Fix it Using Search-based Software Engineering)

Arxiv

3+阅读 · 2018年2月20日

MXNET-MPI: Embedding MPI parallelism in Parameter Server Task Model for scaling Deep Learning

Arxiv

4+阅读 · 2018年1月11日

Multilingual Topic Models

Arxiv

3+阅读 · 2017年12月18日

A Big Data Analysis Framework Using Apache Spark and Deep Learning

Arxiv

3+阅读 · 2017年11月25日

VIP会员

文章信息

相关主题

相关VIP内容

【微软-ACL2020】TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

【微软-ACL2020】TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

专知会员服务

36+阅读 · 2020年4月14日

经济学中的数据科学，Data Science in Economics，附22页pdf

经济学中的数据科学，Data Science in Economics，附22页pdf

专知会员服务

36+阅读 · 2020年4月1日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【大规模数据系统，552页ppt】Large-scale Data Systems

【大规模数据系统，552页ppt】Large-scale Data Systems

专知会员服务

61+阅读 · 2019年12月21日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

开源书：PyTorch深度学习起步

开源书：PyTorch深度学习起步

专知会员服务

51+阅读 · 2019年10月11日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

热门VIP内容

开通专知VIP会员享更多权益服务

最新，DeepSeek-R1论文登上Nature封面，附83页补充材料

人工智能与未来战争

自动驾驶中的轨迹预测大型基础模型：全面综述

万字长文《对抗雷达系统的电子战综述》

相关资讯

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

人工智能 | SCI期刊专刊信息3条

人工智能 | SCI期刊专刊信息3条

Call4Papers

5+阅读 · 2019年1月10日

Ray RLlib: Scalable 降龙十八掌

Ray RLlib: Scalable 降龙十八掌

CreateAMind

9+阅读 · 2018年12月28日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

Hierarchical Imitation - Reinforcement Learning

Hierarchical Imitation - Reinforcement Learning

CreateAMind

19+阅读 · 2018年5月25日

条件GAN重大改进！cGANs with Projection Discriminator

条件GAN重大改进！cGANs with Projection Discriminator

CreateAMind

8+阅读 · 2018年2月7日

分布式TensorFlow入门指南

分布式TensorFlow入门指南

机器学习研究会

4+阅读 · 2017年11月28日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

相关论文

Semantics of Data Mining Services in Cloud Computing

Semantics of Data Mining Services in Cloud Computing

Arxiv

4+阅读 · 2018年10月5日

Multilingual Sentiment Analysis: An RNN-Based Framework for Limited Data

Arxiv

12+阅读 · 2018年6月8日

BigDL: A Distributed Deep Learning Framework for Big Data

Arxiv

4+阅读 · 2018年4月16日

Deep Communicating Agents for Abstractive Summarization

Arxiv

5+阅读 · 2018年3月27日

CuLDA_CGS: Solving Large-scale LDA Problems on GPUs

Arxiv

3+阅读 · 2018年3月13日

The Search Problem in Mixture Models

Arxiv

3+阅读 · 2018年2月24日

What is Wrong with Topic Modeling? (and How to Fix it Using Search-based Software Engineering)

Arxiv

3+阅读 · 2018年2月20日

MXNET-MPI: Embedding MPI parallelism in Parameter Server Task Model for scaling Deep Learning

Arxiv

4+阅读 · 2018年1月11日

Multilingual Topic Models

Arxiv

3+阅读 · 2017年12月18日

A Big Data Analysis Framework Using Apache Spark and Deep Learning

Arxiv

3+阅读 · 2017年11月25日

微信扫码咨询专知VIP会员