SimpLex: 一个词汇文本简化框架 (SimpLex: a lexical text simplification architecture) - 专知论文

会员服务 ·

0

文本简化 · 单纯形 · 困惑度 · 词嵌入 · 变换 ·

2023 年 4 月 14 日

SimpLex: a lexical text simplification architecture

翻译：SimpLex: 一个词汇文本简化框架

Ciprian-Octavian Truică,Andrei-Ionut Stan,Elena-Simona Apostol

Text simplification (TS) is the process of generating easy-to-understand sentences from a given sentence or piece of text. The aim of TS is to reduce both the lexical (which refers to vocabulary complexity and meaning) and syntactic (which refers to the sentence structure) complexity of a given text or sentence without the loss of meaning or nuance. In this paper, we present \textsc{SimpLex}, a novel simplification architecture for generating simplified English sentences. To generate a simplified sentence, the proposed architecture uses either word embeddings (i.e., Word2Vec) and perplexity, or sentence transformers (i.e., BERT, RoBERTa, and GPT2) and cosine similarity. The solution is incorporated into a user-friendly and simple-to-use software. We evaluate our system using two metrics, i.e., SARI, and Perplexity Decrease. Experimentally, we observe that the transformer models outperform the other models in terms of the SARI score. However, in terms of Perplexity, the Word-Embeddings-based models achieve the biggest decrease. Thus, the main contributions of this paper are: (1) We propose a new Word Embedding and Transformer based algorithm for text simplification; (2) We design \textsc{SimpLex} -- a modular novel text simplification system -- that can provide a baseline for further research; and (3) We perform an in-depth analysis of our solution and compare our results with two state-of-the-art models, i.e., LightLS [19] and NTS-w2v [44]. We also make the code publicly available online.

翻译：文本简化是指从给定的句子或文本中生成易于理解的句子的过程。简化的目的是减少文本或句子中的词汇复杂性（指词汇量复杂性和意义）和句法复杂性（指句子结构），而不会丢失意义或细微差别。在本文中，我们提出了SimpLex，一种用于生成简化英语句子的新型简化架构。为了生成简化的句子，所提出的架构使用词嵌入（例如Word2Vec）和困惑度，或使用句子转换器（例如BERT，RoBERTa和GPT2）和余弦相似性。该解决方案结合了一个用户友好且简单易用的软件。我们使用两个度量标准（即SARI和困惑度降低）来评估我们的系统。实验结果表明，变换器模型在SARI评分方面优于其他模型。然而，就困惑度而言，基于Word Embeddings的模型取得了最大的降低。因此，本文的主要贡献是：（1）我们提出了一种新的基于词嵌入和变换器的文本简化算法；（2）我们设计了一个模块化的新型文本简化系统SimpLex，它可以为进一步的研究提供基线；（3）我们对我们的解决方案进行了深入分析，并将我们的结果与两个最先进的模型（即LightLS [19]和NTS-w2v [44]）进行了比较。我们还在网上公开了代码。

0

相关内容

文本简化

【Hugging Face】指导文本生成与约束波束搜索🤗Transformers，Guiding Text Generation with Constrained Beam Search in 🤗 Transformers

【Hugging Face】指导文本生成与约束波束搜索🤗Transformers，Guiding Text Generation with Constrained Beam Search in 🤗 Transformers

专知会员服务

22+阅读 · 2022年3月18日

【ACL2021】Hi-Transformer：一种具有层次化和交互式特点的长文档建模结构

专知会员服务

13+阅读 · 2021年8月4日

【文本生成现代方法】Modern Methods for Text Generation

【文本生成现代方法】Modern Methods for Text Generation

专知会员服务

44+阅读 · 2020年9月11日

最新《文本简化》综述论文，26页pdf，A Survey on Text Simplification

最新《文本简化》综述论文，26页pdf，A Survey on Text Simplification

专知会员服务

15+阅读 · 2020年8月26日

【ACL2020】对抗性文本生成，Improving Adversarial Text Generation

专知会员服务

52+阅读 · 2020年5月5日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

【深度学习架构、模型和技巧集合(TensorFlow/PyTorch)】’Deep Learning Models - A collection of various deep learning architectures, models, and tips'

【深度学习架构、模型和技巧集合(TensorFlow/PyTorch)】’Deep Learning Models - A collection of various deep learning architectures, models, and tips'

专知会员服务

58+阅读 · 2020年1月25日

【AAAI2020-清华大学】张量图卷积网络文本分类，Tensor Graph Convolutional Networks for Text Classification

【AAAI2020-清华大学】张量图卷积网络文本分类，Tensor Graph Convolutional Networks for Text Classification

专知会员服务

76+阅读 · 2020年1月16日

BERT进展2019四篇必读论文

BERT进展2019四篇必读论文

专知会员服务

69+阅读 · 2020年1月2日

论深度学习的信息瓶颈理论（On the information bottleneck theory of deep learning）

论深度学习的信息瓶颈理论（On the information bottleneck theory of deep learning）

专知会员服务

66+阅读 · 2019年12月20日

谷歌&HuggingFace| 零样本能力最强的语言模型结构

谷歌&HuggingFace| 零样本能力最强的语言模型结构

夕小瑶的卖萌屋

0+阅读 · 2022年6月23日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

NLP预训练模型大集合！

NLP预训练模型大集合！

全球人工智能

31+阅读 · 2018年12月29日

【PyTorch实战】手把手教你用torchtext处理文本数据

【PyTorch实战】手把手教你用torchtext处理文本数据

专知

13+阅读 · 2018年6月14日

【论文推荐】最新六篇图像描述生成相关论文—字符级推断、视觉解释、语义对齐、实体感知、确定性非自回归

【论文推荐】最新六篇图像描述生成相关论文—字符级推断、视觉解释、语义对齐、实体感知、确定性非自回归

专知

15+阅读 · 2018年5月28日

【论文推荐】最新七篇自注意力机制(Self-attention)相关论文—结构化自注意力、相对位置、混合、句子表达、文本向量

【论文推荐】最新七篇自注意力机制(Self-attention)相关论文—结构化自注意力、相对位置、混合、句子表达、文本向量

专知

29+阅读 · 2018年3月12日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

【推荐】图像分类必读开创性论文汇总

【推荐】图像分类必读开创性论文汇总

机器学习研究会

14+阅读 · 2017年8月15日

干扰素诱导基因ASB13拮抗流感病毒复制机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

膜蛋白介导受IRES调控的cyclin B1促进食管癌转移的作用机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

脂筏相关蛋白β-adducin调控PSGL-1介导的中性粒细胞起始黏附的作用和机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

非线性Cahn-Hilliard型方程自适应高阶稳定数值方法分析

国家自然科学基金

0+阅读 · 2013年12月31日

实时安全关键系统的建模、仿真与验证

国家自然科学基金

1+阅读 · 2012年12月31日

硒对血管内皮细胞蛋白质巯基亚硝基化的作用及分子机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

跨语言信息检索中的机器翻译研究

国家自然科学基金

2+阅读 · 2011年12月31日

中英双语者语言理解转换中的词汇加工机制

国家自然科学基金

1+阅读 · 2009年12月31日

玉米-小麦轮作系统内生真菌的多样性及相互关系研究

国家自然科学基金

0+阅读 · 2009年12月31日

树、格及Hurwitz排列中的计数问题

国家自然科学基金

0+阅读 · 2008年12月31日

Spoken Language Identification System for English-Mandarin Code-Switching Child-Directed Speech

Arxiv

0+阅读 · 2023年6月1日

Boosting the Performance of Transformer Architectures for Semantic Textual Similarity

Arxiv

0+阅读 · 2023年6月1日

Off-By-One Implementation Error in J-UNIWARD

Arxiv

0+阅读 · 2023年5月31日

Simple yet Effective Code-Switching Language Identification with Multitask Pre-Training and Transfer Learning

Arxiv

0+阅读 · 2023年5月31日

Sentence Simplification Using Paraphrase Corpus for Initialization

Arxiv

0+阅读 · 2023年5月31日

CHiLS: Zero-Shot Image Classification with Hierarchical Label Sets

Arxiv

0+阅读 · 2023年5月31日

LENS: A Learnable Evaluation Metric for Text Simplification

Arxiv

0+阅读 · 2023年5月30日

From Characters to Words: Hierarchical Pre-trained Language Model for Open-vocabulary Language Understanding

Arxiv

0+阅读 · 2023年5月30日

Similarity and Matching of Neural Network Representations

Arxiv

10+阅读 · 2021年10月27日

Syntax-Aware Aspect Level Sentiment Classification with Graph Attention Networks

Syntax-Aware Aspect Level Sentiment Classification with Graph Attention Networks

Arxiv

10+阅读 · 2019年9月5日

VIP会员

文章信息

相关主题

相关VIP内容

【Hugging Face】指导文本生成与约束波束搜索🤗Transformers，Guiding Text Generation with Constrained Beam Search in 🤗 Transformers

【Hugging Face】指导文本生成与约束波束搜索🤗Transformers，Guiding Text Generation with Constrained Beam Search in 🤗 Transformers

专知会员服务

22+阅读 · 2022年3月18日

【ACL2021】Hi-Transformer：一种具有层次化和交互式特点的长文档建模结构

专知会员服务

13+阅读 · 2021年8月4日

【文本生成现代方法】Modern Methods for Text Generation

【文本生成现代方法】Modern Methods for Text Generation

专知会员服务

44+阅读 · 2020年9月11日

最新《文本简化》综述论文，26页pdf，A Survey on Text Simplification

最新《文本简化》综述论文，26页pdf，A Survey on Text Simplification

专知会员服务

15+阅读 · 2020年8月26日

【ACL2020】对抗性文本生成，Improving Adversarial Text Generation

专知会员服务

52+阅读 · 2020年5月5日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

【深度学习架构、模型和技巧集合(TensorFlow/PyTorch)】’Deep Learning Models - A collection of various deep learning architectures, models, and tips'

【深度学习架构、模型和技巧集合(TensorFlow/PyTorch)】’Deep Learning Models - A collection of various deep learning architectures, models, and tips'

专知会员服务

58+阅读 · 2020年1月25日

【AAAI2020-清华大学】张量图卷积网络文本分类，Tensor Graph Convolutional Networks for Text Classification

【AAAI2020-清华大学】张量图卷积网络文本分类，Tensor Graph Convolutional Networks for Text Classification

专知会员服务

76+阅读 · 2020年1月16日

BERT进展2019四篇必读论文

BERT进展2019四篇必读论文

专知会员服务

69+阅读 · 2020年1月2日

论深度学习的信息瓶颈理论（On the information bottleneck theory of deep learning）

论深度学习的信息瓶颈理论（On the information bottleneck theory of deep learning）

专知会员服务

66+阅读 · 2019年12月20日

热门VIP内容

开通专知VIP会员享更多权益服务

《美国太空军系统全生命周期建模、仿真与分析效能提升方案》最新84页报告

《商用大语言模型的升级风险管理：国家安全运用》

自主人工智能：未来战争是否将是自主化的？

《从装备到文化：美陆军技术素养建设启示录》最新报告

相关资讯

谷歌&HuggingFace| 零样本能力最强的语言模型结构

谷歌&HuggingFace| 零样本能力最强的语言模型结构

夕小瑶的卖萌屋

0+阅读 · 2022年6月23日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

NLP预训练模型大集合！

NLP预训练模型大集合！

全球人工智能

31+阅读 · 2018年12月29日

【PyTorch实战】手把手教你用torchtext处理文本数据

【PyTorch实战】手把手教你用torchtext处理文本数据

专知

13+阅读 · 2018年6月14日

【论文推荐】最新六篇图像描述生成相关论文—字符级推断、视觉解释、语义对齐、实体感知、确定性非自回归

【论文推荐】最新六篇图像描述生成相关论文—字符级推断、视觉解释、语义对齐、实体感知、确定性非自回归

专知

15+阅读 · 2018年5月28日

【论文推荐】最新七篇自注意力机制(Self-attention)相关论文—结构化自注意力、相对位置、混合、句子表达、文本向量

【论文推荐】最新七篇自注意力机制(Self-attention)相关论文—结构化自注意力、相对位置、混合、句子表达、文本向量

专知

29+阅读 · 2018年3月12日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

【推荐】图像分类必读开创性论文汇总

【推荐】图像分类必读开创性论文汇总

机器学习研究会

14+阅读 · 2017年8月15日

相关论文

Spoken Language Identification System for English-Mandarin Code-Switching Child-Directed Speech

Arxiv

0+阅读 · 2023年6月1日

Boosting the Performance of Transformer Architectures for Semantic Textual Similarity

Arxiv

0+阅读 · 2023年6月1日

Off-By-One Implementation Error in J-UNIWARD

Arxiv

0+阅读 · 2023年5月31日

Simple yet Effective Code-Switching Language Identification with Multitask Pre-Training and Transfer Learning

Arxiv

0+阅读 · 2023年5月31日

Sentence Simplification Using Paraphrase Corpus for Initialization

Arxiv

0+阅读 · 2023年5月31日

CHiLS: Zero-Shot Image Classification with Hierarchical Label Sets

Arxiv

0+阅读 · 2023年5月31日

LENS: A Learnable Evaluation Metric for Text Simplification

Arxiv

0+阅读 · 2023年5月30日

From Characters to Words: Hierarchical Pre-trained Language Model for Open-vocabulary Language Understanding

Arxiv

0+阅读 · 2023年5月30日

Similarity and Matching of Neural Network Representations

Arxiv

10+阅读 · 2021年10月27日

Syntax-Aware Aspect Level Sentiment Classification with Graph Attention Networks

Syntax-Aware Aspect Level Sentiment Classification with Graph Attention Networks

Arxiv

10+阅读 · 2019年9月5日

相关基金

干扰素诱导基因ASB13拮抗流感病毒复制机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

膜蛋白介导受IRES调控的cyclin B1促进食管癌转移的作用机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

脂筏相关蛋白β-adducin调控PSGL-1介导的中性粒细胞起始黏附的作用和机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

非线性Cahn-Hilliard型方程自适应高阶稳定数值方法分析

国家自然科学基金

0+阅读 · 2013年12月31日

实时安全关键系统的建模、仿真与验证

国家自然科学基金

1+阅读 · 2012年12月31日

硒对血管内皮细胞蛋白质巯基亚硝基化的作用及分子机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

跨语言信息检索中的机器翻译研究

国家自然科学基金

2+阅读 · 2011年12月31日

中英双语者语言理解转换中的词汇加工机制

国家自然科学基金

1+阅读 · 2009年12月31日

玉米-小麦轮作系统内生真菌的多样性及相互关系研究

国家自然科学基金

0+阅读 · 2009年12月31日

树、格及Hurwitz排列中的计数问题

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员