带有自动检测多模式厌恶女性内容文本转录的模块基准数据集 (Benchmark dataset of memes with text transcriptions for automatic detection of multi-modal misogynistic content)

In this paper we present a benchmark dataset generated as part of a project for automatic identification of misogyny within online content, which focuses in particular on memes. The benchmark here described is composed of 800 memes collected from the most popular social media platforms, such as Facebook, Twitter, Instagram and Reddit, and consulting websites dedicated to collection and creation of memes. To gather misogynistic memes, specific keywords that refer to misogynistic content have been considered as search criterion, considering different manifestations of hatred against women, such as body shaming, stereotyping, objectification and violence. In parallel, memes with no misogynist content have been manually downloaded from the same web sources. Among all the collected memes, three domain experts have selected a dataset of 800 memes equally balanced between misogynistic and non-misogynistic ones. This dataset has been validated through a crowdsourcing platform, involving 60 subjects for the labelling process, in order to collect three evaluations for each instance. Two further binary labels have been collected from both the experts and the crowdsourcing platform, for memes evaluated as misogynistic, concerning aggressiveness and irony. Finally for each meme, the text has been manually transcribed. The dataset provided is thus composed of the 800 memes, the labels given by the experts and those obtained by the crowdsourcing validation, and the transcribed texts. This data can be used to approach the problem of automatic detection of misogynistic content on the Web relying on both textual and visual cues, facing phenomenons that are growing every day such as cybersexism and technology-facilitated violence.

翻译：在本文中,我们介绍了一个基准数据集,这是自动识别在线内容内不孕不育现象项目的一部分,该项目特别侧重于Memes。这里描述的基准由从Facebook、Twitter、Instagram和Reddit等最受欢迎的社交媒体平台收集的800个Memes组成,以及专门收集和创建Memes的咨询网站。为了收集不相识的Memes,提到不相识内容的具体关键字被视为搜索标准,考虑到对妇女的仇恨的不同表现,例如身体毁损、定型、目标化和暴力。与此同时,从同一个网络来源手动下载了没有不相识性内容的图像。在所有收集的Memes中,有3个域专家选择了800个数据集,这些数据集同样平衡地用于收集和创建Memesmme。这个包含60个标签过程主题的特定关键字被验证,以便收集每例问题的三种评价。从专家和众包平台又收集了另外两个两条直线标签标签标签,用于作为错误的图像检测和图解的文本,因此,每个图解的图解的文本都用于我和图解的图案。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

深度概率图模型，Deep Probabilistic Models

专知会员服务

29+阅读 · 2021年8月2日

【深度学习社区检测】Deep Learning for Community Detection: Progress, Challenges and Opportunities

专知会员服务

28+阅读 · 2020年6月13日

【快讯】ICML 2020论文出炉，1088篇上榜，你的paper中了吗？

专知会员服务

52+阅读 · 2020年6月1日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日