In this paper, the process of converting the Enron email dataset (the version cited in the preprint) to thousands of features per email for a selected set of 2400 labelled emails is explained and evaluated. The final features are tailored for Cosine distance so that the Cosine distance invertly reflect the number of top indicative words of each email that are common between the two emails in an explainable normalized fashion. The labelling is based on the leaf folder name in the Enron email dataset (the version cited in the preprint) folders tree and the 2400 emails selected consist 300 emails for each of the 8 labels. The evaluation is based on the accuracy of a k nearest neighbours majority voting classification using Cosine distance. In addition to KNN majority voting classification accuracy and confusion matrix, some statistics for the process is reported. The KNN majority voting classification accuracy using Cosine distance is 76.75% which shows at least some level of success given the 8 labels involved. The result of conversion is 48557 features per selected email out of which exactly 40 features per email are non-zero. The result of conversion is a data set named MeeefTCD (Massive Enhanced Extracted Email Features Tailored for Cosine Distance) available at https://web.cs.dal.ca/~barahimi/data-sets/meeeftcd/ and on a github repository mentioned in this paper.
翻译:在本文中, 将 Enron 电子邮件数据集( 预印中引用的版本) 转换为每部电子邮件的数千个功能的过程得到解释和评估。 最后的功能是为Cosine 距离定制的, 以便Cosine 距离能以可解释的正常化方式反倒反映两个电子邮件之间常见的每个电子邮件的顶级提示词数。 标签基于 Enron 电子邮件数据集( 预印中引用的版本) 的叶子文件夹名称 。 标签基于 Enron 电子邮件( 预印中引用的版本) 文件夹树和所选的 24 00 email 包含8 标签中每个标签的300 个电子邮件 。 评估基于使用 Cosine 距离的近邻多数选举分类的准确性 。 除了 KNN 多数选举分类的准确性和混乱矩阵外, 进程的一些统计数据被报告。 KNN 使用 Cosine 距离的多数选举分类准确性为76. 75% 显示8 标签至少一定的成功程度。 转换的结果是每个选中的电子邮件有48557 的特性, 其中每部有40个功能。 。 。 转换的结果是在Meef- dealalalemb/ developmentalemisalisalalaldaldaldaldalmax 。