AtteSTNet：一种基于注意力和子词分词的混合码文本仇恨言论检测方法 (AtteSTNet -- An attention and subword tokenization based approach for code-switched text hate speech detection) - 专知论文

会员服务 ·

0

分词 · 一元语法 · 仇恨言论检测 · 混合 · 社交媒体 ·

2023 年 3 月 28 日

AtteSTNet -- An attention and subword tokenization based approach for code-switched text hate speech detection

翻译：AtteSTNet：一种基于注意力和子词分词的混合码文本仇恨言论检测方法

Geet Shingi,Vedangi Wagh,Kishor Wagh,Sharmila Wagh

Recent advancements in technology have led to a boost in social media usage which has ultimately led to large amounts of user-generated data which also includes hateful and offensive speech. The language used in social media is often a combination of English and the native language in the region. In India, Hindi is used predominantly and is often code-switched with English, giving rise to the Hinglish (Hindi+English) language. Various approaches have been made in the past to classify the code-mixed Hinglish hate speech using different machine learning and deep learning-based techniques. However, these techniques make use of recurrence on convolution mechanisms which are computationally expensive and have high memory requirements. Past techniques also make use of complex data processing making the existing techniques very complex and non-sustainable to change in data. Proposed work gives a much simpler approach which is not only at par with these complex networks but also exceeds performance with the use of subword tokenization algorithms like BPE and Unigram, along with multi-head attention-based techniques, giving an accuracy of 87.41% and an F1 score of 0.851 on standard datasets. Efficient use of BPE and Unigram algorithms help handle the nonconventional Hinglish vocabulary making the proposed technique simple, efficient and sustainable to use in the real world.

翻译：近年来，技术的进步引发了社交媒体使用量的增加，最终导致大量包括仇恨和攻击性言论的用户生成数据。社交媒体中使用的语言通常是英语和当地语言的组合。在印度，印地语被广泛使用，通常与英语混合使用，形成了Hinglish（印地语+英语）语言。过去已经尝试了各种不同的机器学习和深度学习方法来分类Hinglish混合码仇恨言论。然而，这些技术使用的是递归或卷积机制，具有高计算和记忆开销。过去的技术还使用了复杂的数据处理方法，使得现有技术非常复杂，不可持续应对数据变化。提出的方法采用了更简单的方法，不仅不逊于这些复杂网络，而且在使用BPE和Unigram等子词分词算法和基于多头注意力的技术的情况下，能够提高性能，在标准数据集上获得了87.41％的准确度和0.851的F1得分。BPE和Unigram算法的有效使用有助于处理非常规的Hinglish词汇，从而使所建议的技术简单、高效、可持续用于实际应用场景。

0

相关内容

将一个汉字序列切分成一个一个单独的词

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

125+阅读 · 2022年4月21日

语音识别:不同深度学习方法的综述，Speech Recognition: a review of the different deep learning approaches

语音识别:不同深度学习方法的综述，Speech Recognition: a review of the different deep learning approaches

专知会员服务

33+阅读 · 2022年3月13日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【论文翻译】NLP注意力机制综述论文翻译，Attention, please! A Critical Review of Neural Attention Models in Natural Language Processing

【论文翻译】NLP注意力机制综述论文翻译，Attention, please! A Critical Review of Neural Attention Models in Natural Language Processing

专知会员服务

96+阅读 · 2020年4月18日

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

专知会员服务

28+阅读 · 2020年2月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【中科院计算所 | 文献综述】自然语言生成的无监督前训练:文献综述，Unsupervised Pre-training for Natural Language Generation: A Literature Review

【中科院计算所 | 文献综述】自然语言生成的无监督前训练:文献综述，Unsupervised Pre-training for Natural Language Generation: A Literature Review

专知会员服务

48+阅读 · 2019年11月15日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

使用BERT做文本摘要

使用BERT做文本摘要

专知

23+阅读 · 2019年12月7日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

基于PyTorch/TorchText的自然语言处理库

基于PyTorch/TorchText的自然语言处理库

专知

28+阅读 · 2019年4月22日

自然语言处理 | 使用Spacy 进行自然语言处理

自然语言处理 | 使用Spacy 进行自然语言处理

机器学习和数学

19+阅读 · 2018年8月22日

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

专知

37+阅读 · 2018年2月21日

【推荐】YOLO实时目标检测(6fps)

【推荐】YOLO实时目标检测(6fps)

机器学习研究会

20+阅读 · 2017年11月5日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

【推荐】全卷积语义分割综述

【推荐】全卷积语义分割综述

机器学习研究会

19+阅读 · 2017年8月31日

鸡骨胶原蛋白肽增强肠道免疫活性的构效机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

无监督分词及词性归纳联合方法研究

国家自然科学基金

1+阅读 · 2013年12月31日

大规模概率数据的管理与查询优化

国家自然科学基金

0+阅读 · 2012年12月31日

大规模非平稳多元混沌时间序列分析与建模研究

国家自然科学基金

2+阅读 · 2012年12月31日

基于复杂网络的中文文本语义相似度研究

国家自然科学基金

3+阅读 · 2012年12月31日

关键词抽取与社会标签推荐相结合的中文文本主题词自动标注方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

跨语言信息检索中的机器翻译研究

国家自然科学基金

2+阅读 · 2011年12月31日

基于SSD的大规模元数据处理技术研究

国家自然科学基金

1+阅读 · 2009年12月31日

中文医学文本中关联信息提取方法研究

国家自然科学基金

2+阅读 · 2009年12月31日

微型化CE－AD/C4D系统在常见代谢病诊断中的方法研究

国家自然科学基金

0+阅读 · 2008年12月31日

Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders

Arxiv

0+阅读 · 2023年5月18日

Graph Global Attention Network with Memory for Fake News Detection

Arxiv

0+阅读 · 2023年5月17日

Exploring the Space of Key-Value-Query Models with Intention

Arxiv

0+阅读 · 2023年5月17日

Life of PII -- A PII Obfuscation Transformer

Life of PII -- A PII Obfuscation Transformer

Arxiv

0+阅读 · 2023年5月17日

WordStylist: Styled Verbatim Handwritten Text Generation with Latent Diffusion Models

Arxiv

0+阅读 · 2023年5月17日

Efficient Neural Generation of 4K Masks for Homogeneous Diffusion Inpainting

Arxiv

0+阅读 · 2023年5月16日

A Survey of Knowledge-Enhanced Text Generation

Arxiv

18+阅读 · 2020年10月9日

Pre-Training with Whole Word Masking for Chinese BERT

Arxiv

11+阅读 · 2019年6月19日

Multilingual Sentiment Analysis: An RNN-Based Framework for Limited Data

Arxiv

12+阅读 · 2018年6月8日

DeepSeek: Content Based Image Search & Retrieval

Arxiv

13+阅读 · 2018年1月11日

VIP会员

文章信息

相关主题

仇恨言论检测

相关VIP内容

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

125+阅读 · 2022年4月21日

语音识别:不同深度学习方法的综述，Speech Recognition: a review of the different deep learning approaches

语音识别:不同深度学习方法的综述，Speech Recognition: a review of the different deep learning approaches

专知会员服务

33+阅读 · 2022年3月13日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【论文翻译】NLP注意力机制综述论文翻译，Attention, please! A Critical Review of Neural Attention Models in Natural Language Processing

【论文翻译】NLP注意力机制综述论文翻译，Attention, please! A Critical Review of Neural Attention Models in Natural Language Processing

专知会员服务

96+阅读 · 2020年4月18日

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

专知会员服务

28+阅读 · 2020年2月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【中科院计算所 | 文献综述】自然语言生成的无监督前训练:文献综述，Unsupervised Pre-training for Natural Language Generation: A Literature Review

【中科院计算所 | 文献综述】自然语言生成的无监督前训练:文献综述，Unsupervised Pre-training for Natural Language Generation: A Literature Review

专知会员服务

48+阅读 · 2019年11月15日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

【ICCV2025】多视角三维点跟踪

《基于俄乌战争经验的俄罗斯未来战争理念》最新98页报告

定向能武器如何革新2025年反无人机与反导体系

【IJCAI2025教程】联邦组合优化与双层优化

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

使用BERT做文本摘要

使用BERT做文本摘要

专知

23+阅读 · 2019年12月7日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

基于PyTorch/TorchText的自然语言处理库

基于PyTorch/TorchText的自然语言处理库

专知

28+阅读 · 2019年4月22日

自然语言处理 | 使用Spacy 进行自然语言处理

自然语言处理 | 使用Spacy 进行自然语言处理

机器学习和数学

19+阅读 · 2018年8月22日

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

专知

37+阅读 · 2018年2月21日

【推荐】YOLO实时目标检测(6fps)

【推荐】YOLO实时目标检测(6fps)

机器学习研究会

20+阅读 · 2017年11月5日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

【推荐】全卷积语义分割综述

【推荐】全卷积语义分割综述

机器学习研究会

19+阅读 · 2017年8月31日

相关论文

Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders

Arxiv

0+阅读 · 2023年5月18日

Graph Global Attention Network with Memory for Fake News Detection

Arxiv

0+阅读 · 2023年5月17日

Exploring the Space of Key-Value-Query Models with Intention

Arxiv

0+阅读 · 2023年5月17日

Life of PII -- A PII Obfuscation Transformer

Life of PII -- A PII Obfuscation Transformer

Arxiv

0+阅读 · 2023年5月17日

WordStylist: Styled Verbatim Handwritten Text Generation with Latent Diffusion Models

Arxiv

0+阅读 · 2023年5月17日

Efficient Neural Generation of 4K Masks for Homogeneous Diffusion Inpainting

Arxiv

0+阅读 · 2023年5月16日

A Survey of Knowledge-Enhanced Text Generation

Arxiv

18+阅读 · 2020年10月9日

Pre-Training with Whole Word Masking for Chinese BERT

Arxiv

11+阅读 · 2019年6月19日

Multilingual Sentiment Analysis: An RNN-Based Framework for Limited Data

Arxiv

12+阅读 · 2018年6月8日

DeepSeek: Content Based Image Search & Retrieval

Arxiv

13+阅读 · 2018年1月11日

相关基金

鸡骨胶原蛋白肽增强肠道免疫活性的构效机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

无监督分词及词性归纳联合方法研究

国家自然科学基金

1+阅读 · 2013年12月31日

大规模概率数据的管理与查询优化

国家自然科学基金

0+阅读 · 2012年12月31日

大规模非平稳多元混沌时间序列分析与建模研究

国家自然科学基金

2+阅读 · 2012年12月31日

基于复杂网络的中文文本语义相似度研究

国家自然科学基金

3+阅读 · 2012年12月31日

关键词抽取与社会标签推荐相结合的中文文本主题词自动标注方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

跨语言信息检索中的机器翻译研究

国家自然科学基金

2+阅读 · 2011年12月31日

基于SSD的大规模元数据处理技术研究

国家自然科学基金

1+阅读 · 2009年12月31日

中文医学文本中关联信息提取方法研究

国家自然科学基金

2+阅读 · 2009年12月31日

微型化CE－AD/C4D系统在常见代谢病诊断中的方法研究

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员