Canonical data models (CDM) have gained traction as a pattern for data integration in streaming pipelines that extract, transform and load data (ETL). CDMs are in particular useful for integrating microservice systems. (Villaca et al 2020, Oliveira et al 2019) However, the transformation to a CDM is complex. (Lemcke et al 2012) In this paper, we present a new solution that is based on a new dynamic mapping matrix (DMM). The DMM has been implemented into an app called Message ETL (METL). METL is the key part of a new ETL streaming pipeline at EOS. EOS is part of the Otto-Group, the second-largest e-commerce provider in Europe. The pipeline is based on Kafka streams. METL transforms Kafka messages, that contain a set of data objects described by one of n' different extracting schemata. It transforms each of these incoming messages into several outgoing ones. Each outgoing message contains a sub-set of the incoming data objects, but describes them with a different schema, namely one of m' different CDM schemata. For the mapping, METL requires a matrix that consists of m'xn' sub-matrix mapping blocks. There are three problems, namely the sparsity of the matrix, the adaption of the matrix to changes in the schemata and time efficiency. We solve these problems by block-partitioning, sub-matrix formation and pattern generalization. In this process, we derive sets of permutation matrices. We show that they can be used for automated updates, for parallel computation in near real-time and compacting. The set of all permutation matrices forms the dynamic mapping matrix. For the solution, we draw on research into matrix partitioning (Quinn 2004) and dynamic networks (Haase et al 2021).
翻译:(Villaca等人,2020年,Oliveira等人,2019年) 但是,向清洁发展机制的转变是复杂的。 (Lemcke等人,2012年) 在本文件中,我们提出了一个基于新的动态绘图矩阵的新解决方案。 DMMM 已经应用到一个名为 Messe ETL (METL) 的应用程序中的数据整合模式。 METL 是ESA中一个新的 ETL 流管流流流中的数据整合模式的关键部分。 EOS 是欧洲第二大的电子商务供应商Otto-Group的一部分。 但是,向清洁发展机制的转变是复杂的。 (Lemckee等人,2012年) 。 我们用一个新的动态绘图矩阵(Demckekeke) 来描述一组数据对象。 DMMMMMDM 将每个收到的信息转换到多个发送信息(METL) 。 每个发送信息都包含一个子数据集的子集, 但是我们用一个不同的系统流流流流流流流流流流流流流流流流 来描述它们, 也就是一个MMIS 系统内部流流流流流流的系统, 的系统, 将所有数据流流流流流流流流流的系统元化到每个系统流的系统, 。