Sentiment analysis is a domain of study that focuses on identifying and classifying the ideas expressed in the form of text into positive, negative and neutral polarities. Feature selection is a crucial process in machine learning. In this paper, we aim to study the performance of different feature selection techniques for sentiment analysis. Term Frequency Inverse Document Frequency (TF-IDF) is used as the feature extraction technique for creating feature vocabulary. Various Feature Selection (FS) techniques are experimented to select the best set of features from feature vocabulary. The selected features are trained using different machine learning classifiers Logistic Regression (LR), Support Vector Machines (SVM), Decision Tree (DT) and Naive Bayes (NB). Ensemble techniques Bagging and Random Subspace are applied on classifiers to enhance the performance on sentiment analysis. We show that, when the best FS techniques are trained using ensemble methods achieve remarkable results on sentiment analysis. We also compare the performance of FS methods trained using Bagging, Random Subspace with varied neural network architectures. We show that FS techniques trained using ensemble classifiers outperform neural networks requiring significantly less training time and parameters thereby eliminating the need for extensive hyper-parameter tuning.
翻译:感官分析是一个研究领域,其重点是确定和分类以文字形式表达的想法,将其分为正、负和中两极。特征选择是机器学习中的一个关键过程。在本文件中,我们的目标是研究用于情绪分析的不同特征选择技术的性能。特频反向文档频率(TF-IDF)是用来制作特征词汇的特征提取技术。各种特征选择(FS)技术都实验,以便从特征词汇中选择最佳的一套特征。选定的特征是使用不同的机器学习分类器、支持矢量机器、决定树(DT)和Nive Bayes(NB)来培训的。在分类器中应用嵌套和随机子空间来提高情绪分析的性能。我们表明,在使用混合方法进行最佳FS技术培训时,在情绪分析上取得显著的结果。我们还比较了使用粘贴、随机次等空间培训的FS方法的性能与各种神经网络结构。我们表明,使用感官分类器培训的FSS技术需要超越神经网络,因此不需要大量培训的时间和超时的参数。