InvBERT: 从用于文学作品衍生文本格式的上下文式嵌入中进行文字重建 (InvBERT: Text Reconstruction from Contextualized Embeddings used for Derived Text Formats of Literary Works)

Digital Humanities and Computational Literary Studies apply text mining methods to investigate literature. Such automated approaches enable quantitative studies on large corpora which would not be feasible by manual inspection alone. However, due to copyright restrictions, the availability of relevant digitized literary works is limited. Derived Text Formats (DTFs) have been proposed as a solution. Here, textual materials are transformed in such a way that copyright-critical features are removed, but that the use of certain analytical methods remains possible. Contextualized word embeddings produced by transformer-encoders (like BERT) are promising candidates for DTFs because they allow for state-of-the-art performance on various analytical tasks and, at first sight, do not disclose the original text. However, in this paper we demonstrate that under certain conditions the reconstruction of the original copyrighted text becomes feasible and its publication in the form of contextualized word representations is not safe. Our attempts to invert BERT suggest, that publishing parts of the encoder together with the contextualized embeddings is critical, since it allows to generate data to train a decoder with a reconstruction accuracy sufficient to violate copyright laws.

翻译：数字人文和计算文学研究采用文字挖掘方法来调查文献。这种自动化方法使得对大型公司进行量性研究成为仅靠人工检查是行不通的。然而,由于版权的限制,相关的数字化文学作品的可用性有限。提出了衍生文本格式(DTFs)作为解决办法。在这里,文本材料的转换方式可以消除版权关键特征,但使用某些分析方法仍然是可能的。变压器-编码器(如BERT)产生的背景化词嵌入是DTF的有希望的候选对象,因为它们允许在各种分析任务上达到最新水平,而且首先不会披露原始文本。然而,在本文中,我们表明在某些条件下,重塑原始版权文本是可行的,以背景化文字表述的形式出版这种文本并不安全。我们试图倒转BERT认为,将编码部分与背景化嵌入器一起出版是关键,因为它允许生成数据,以培训解码器,其重建准确性足以违反版权法律。

相关内容

Automator

关注 4

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

因果知识图谱自然语言理解

专知会员服务

79+阅读 · 2021年7月3日

2020数据工程师成长路线图

专知会员服务

38+阅读 · 2020年9月6日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

95+阅读 · 2020年5月31日

人工智能如何用于抵抗COVID-19？Mila这份《AI against COVID-19 》PPT

专知会员服务

46+阅读 · 2020年5月17日