特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS ),或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化,是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

Competition-based FDR control has been commonly used for over a decade in the computational mass spectrometry community (Elias and Gygi, 2007). Recently, the approach has gained significant popularity in other fields after Barber and Candes (2015) laid its theoretical foundation in a more general setting that included the feature selection problem. In both cases, the competition is based on a head-to-head comparison between an observed score and a corresponding decoy / knockoff. Keich and Noble (2017b) recently demonstrated some advantages of using multiple rather than a single decoy when addressing the problem of assigning peptide sequences to observed mass spectra. In this work, we consider a related problem -- detecting peptides based on a collection of mass spectra -- and we develop a new framework for competition-based FDR control using multiple null scores. Within this framework, we offer several methods, all of which are based on a novel procedure that rigorously controls the FDR in the finite sample setting. Using real data to study the peptide detection problem we show that, relative to existing single-decoy methods, our approach can increase the number of discovered peptides by up to 50% at small FDR thresholds.

0+
0+
下载
预览

In this paper, we propose a new wrapper approach for semi-supervised feature selection. A common strategy in semi-supervised learning is to augment the training set by pseudo-labeled unlabeled examples. However, the pseudo-labeling procedure is prone to error and has a high risk of disrupting the learning algorithm with additional noisy labeled training data. To overcome this, we propose to model explicitly the mislabeling error during the learning phase with the overall aim of selecting the most relevant feature characteristics. We derive a $\mathcal{C}$-bound for Bayes classifiers trained over partially labeled training sets by taking into account the mislabeling errors. The risk bound is then considered as an objective function that is minimized over the space of possible feature subsets using a genetic algorithm. In order to produce both sparse and accurate solution, we propose a modification of a genetic algorithm with the crossover based on feature weights and recursive elimination of irrelevant features. Empirical results on different data sets show the effectiveness of our framework compared to several state-of-the-art semi-supervised feature selection approaches.

0+
0+
下载
预览

Markov blanket feature selection, while theoretically optimal, generally is challenging to implement. This is due to the shortcomings of existing approaches to conditional independence (CI) testing, which tend to struggle either with the curse of dimensionality or computational complexity. We propose a novel two-step approach which facilitates Markov blanket feature selection in high dimensions. First, neural networks are used to map features to low-dimensional representations. In the second step, CI testing is performed by applying the k-NN conditional mutual information estimator to the learned feature maps. The mappings are designed to ensure that mapped samples both preserve information and share similar information about the target variable if and only if they are close in Euclidean distance. We show that these properties boost the performance of the k-NN estimator in the second step. The performance of the proposed method is evaluated on synthetic, as well as real data pertaining to datacenter hard disk drive failures.

0+
0+
下载
预览

Prototype-based methods are of the particular interest for domain specialists and practitioners as they summarize a dataset by a small set of representatives. Therefore, in a classification setting, interpretability of the prototypes is as significant as the prediction accuracy of the algorithm. Nevertheless, the state-of-the-art methods make inefficient trade-offs between these concerns by sacrificing one in favor of the other, especially if the given data has a kernel-based representation. In this paper, we propose a novel interpretable multiple-kernel prototype learning (IMKPL) to construct highly interpretable prototypes in the feature space, which are also efficient for the discriminative representation of the data. Our method focuses on the local discrimination of the classes in the feature space and shaping the prototypes based on condensed class-homogeneous neighborhoods of data. Besides, IMKPL learns a combined embedding in the feature space in which the above objectives are better fulfilled. When the base kernels coincide with the data dimensions, this embedding results in a discriminative features selection. We evaluate IMKPL on several benchmarks from different domains which demonstrate its superiority to the related state-of-the-art methods regarding both interpretability and discriminative representation.

0+
0+
下载
预览

We study the problem of learning manipulation skills from human demonstration video by inferring association relationship between geometric features. Our motivation comes from the observation in human eye-hand coordination that a set of manipulation skills are actually minimizing the Euclidean distance between geometric primitives while regressing their association constraints in non-Euclidean space. We propose a graph based kernel regression method to directly infer the underlying association constraints from human demonstration video using Incremental Maximum Entropy Inverse Reinforcement Learning (InMaxEnt IRL). The learned skill inference provides human readable task definition and outputs control errors that can be directly plugged into traditional controllers. Our method removes the need of tedious feature selection and robust feature trackers in traditional approaches (e.g. feature based visual servoing). Experiments show our method reaches high accuracy even with only one human demonstration video and generalize well under variances.

0+
0+
下载
预览

Background: Functional magnetic resonance imaging (fMRI) provides non-invasive measures of neuronal activity using an endogenous Blood Oxygenation-Level Dependent (BOLD) contrast. This article introduces a nonlinear dimensionality reduction (Locally Linear Embedding) to extract informative measures of the underlying neuronal activity from BOLD time-series. The method is validated using the Leave-One-Out-Cross-Validation (LOOCV) accuracy of classifying psychiatric diagnoses using resting-state and task-related fMRI. Methods: Locally Linear Embedding of BOLD time-series (into each voxel's respective tensor) was used to optimise feature selection. This uses Gau\ss' Principle of Least Constraint to conserve quantities over both space and time. This conservation was assessed using LOOCV to greedily select time points in an incremental fashion on training data that was categorised in terms of psychiatric diagnoses. Findings: The embedded fMRI gave highly diagnostic performances (> 80%) on eleven publicly-available datasets containing healthy controls and patients with either Schizophrenia, Attention-Deficit Hyperactivity Disorder (ADHD), or Autism Spectrum Disorder (ASD). Furthermore, unlike the original fMRI data before or after using Principal Component Analysis (PCA) for artefact reduction, the embedded fMRI furnished significantly better than chance classification (defined as the majority class proportion) on ten of eleven datasets Interpretation: Locally Linear Embedding appears to be a useful feature extraction procedure that retains important information about patterns of brain activity distinguishing among psychiatric cohorts.

0+
0+
下载
预览

Biomedical data are widely accepted in developing prediction models for identifying a specific tumor, drug discovery and classification of human cancers. However, previous studies usually focused on different classifiers, and overlook the class imbalance problem in real-world biomedical datasets. There are a lack of studies on evaluation of data pre-processing techniques, such as resampling and feature selection, on imbalanced biomedical data learning. The relationship between data pre-processing techniques and the data distributions has never been analysed in previous studies. This article mainly focuses on reviewing and evaluating some popular and recently developed resampling and feature selection methods for class imbalance learning. We analyse the effectiveness of each technique from data distribution perspective. Extensive experiments have been done based on five classifiers, four performance measures, eight learning techniques across twenty real-world datasets. Experimental results show that: (1) resampling and feature selection techniques exhibit better performance using support vector machine (SVM) classifier. However, resampling and Feature Selection techniques perform poorly when using C4.5 decision tree and Linear discriminant analysis classifiers; (2) for datasets with different distributions, techniques such as Random undersampling and Feature Selection perform better than other data pre-processing methods with T Location-Scale distribution when using SVM and KNN (K-nearest neighbours) classifiers. Random oversampling outperforms other methods on Negative Binomial distribution using Random Forest classifier with lower level of imbalance ratio; (3) Feature Selection outperforms other data pre-processing methods in most cases, thus, Feature Selection with SVM classifier is the best choice for imbalanced biomedical data learning.

0+
0+
下载
预览

Industrial Control Networks (ICN) such as Supervisory Control and Data Acquisition (SCADA) systems are widely used in industries for monitoring and controlling physical processes. These industries include power generation and supply, gas and oil production and delivery, water and waste management, telecommunication and transport facilities. The integration of internet exposes these systems to cyber threats. The consequences of compromised ICN are determine for a country economic and functional sustainability. Therefore, enforcing security and ensuring correctness operation became one of the biggest concerns for Industrial Control Systems (ICS), and need to be addressed. In this paper, we propose an anomaly detection approach for ICN using the physical properties of the system. We have developed operational baseline of electricity generation process and reduced the feature set using greedy and genetic feature selection algorithms. The classification is done based on Support Vector Machine (SVM), k-Nearest Neighbor (k-NN), and C4.5 decision tree with the help from the inter-arrival curves. The results show that the proposed approach successfully detects anomalies with a high degree of accuracy. In addition, they proved that SVM and C4.5 produces accurate results even for high sensitivity attacks when they used with the inter-arrival curves. As compared to this, k-NN is unable to produce good results for low and medium sensitivity attacks test cases.

0+
0+
下载
预览

Sentiment analysis is a domain of study that focuses on identifying and classifying the ideas expressed in the form of text into positive, negative and neutral polarities. Feature selection is a crucial process in machine learning. In this paper, we aim to study the performance of different feature selection techniques for sentiment analysis. Term Frequency Inverse Document Frequency (TF-IDF) is used as the feature extraction technique for creating feature vocabulary. Various Feature Selection (FS) techniques are experimented to select the best set of features from feature vocabulary. The selected features are trained using different machine learning classifiers Logistic Regression (LR), Support Vector Machines (SVM), Decision Tree (DT) and Naive Bayes (NB). Ensemble techniques Bagging and Random Subspace are applied on classifiers to enhance the performance on sentiment analysis. We show that, when the best FS techniques are trained using ensemble methods achieve remarkable results on sentiment analysis. We also compare the performance of FS methods trained using Bagging, Random Subspace with varied neural network architectures. We show that FS techniques trained using ensemble classifiers outperform neural networks requiring significantly less training time and parameters thereby eliminating the need for extensive hyper-parameter tuning.

1+
0+
下载
预览

Metaheuristic algorithms (MAs) have seen unprecedented growth thanks to their successful applications in fields including engineering and health sciences. In this work, we investigate the use of a deep learning (DL) model as an alternative tool to do so. The proposed method, called MaNet, is motivated by the fact that most of the DL models often need to solve massive nasty optimization problems consisting of millions of parameters. Feature selection is the main adopted concepts in MaNet that helps the algorithm to skip irrelevant or partially relevant evolutionary information and uses those which contribute most to the overall performance. The introduced model is applied on several unimodal and multimodal continuous problems. The experiments indicate that MaNet is able to yield competitive results compared to one of the best hand-designed algorithms for the aforementioned problems, in terms of the solution accuracy and scalability.

1+
0+
下载
预览

Motivation: Long non-coding RNAs (lncRNAs) are a diverse class of RNA molecules with a length above 200 nucleotides that do not encode proteins. Since lncRNAs have involved in a wide range of functions in cellular and developmental processes, an increasing number of methods or tools for distin-guishing lncRNAs from coding RNAs have been proposed. However, most of the existing methods are designed for lncRNAs in animal systems, and only a few methods focus on the plant lncRNA identifica-tion. Different from lncRNAs in animal systems, plant lncRNAs have distinct characteristics. It is desira-ble to develop a computational method for accurate and rapid identification of plant lncRNAs. Results: Herein, we present a plant lncRNA prediction approach PtLnc-BXE, which combines multiple sequence features in two steps to develop an ensemble mode. First, a diverse number of plants lncRNA features are collected and filtered by feature selection and subsequently used to represent RNA se-quences. Then, the training dataset is sampled into several subsets using the bootstrapping technique, and base learners are constructed on data subsets by using XGBoost, and multiple base learners are further combined into a single meta-learner by using logistic regression. PtLnc-BXE outperformed other state-of-the-art plant lncRNA prediction methods, achieving higher AUC (> 95.9%) on the benchmark datasets. The studies across different plant species reveal that the different species have a high overlap between their selected features for modeling. Therefore, it is possible to build the cross-species predic-tion models for plant lncRNAs. Availability: The scripts and data can be downloaded at https://github.com/xxxxx Contact: example@example.org Supplementary information: Supplementary data are available at Bioinformatics online.

0+
0+
下载
预览

We propose the Sobolev Independence Criterion (SIC), an interpretable dependency measure between a high dimensional random variable X and a response variable Y . SIC decomposes to the sum of feature importance scores and hence can be used for nonlinear feature selection. SIC can be seen as a gradient regularized Integral Probability Metric (IPM) between the joint distribution of the two random variables and the product of their marginals. We use sparsity inducing gradient penalties to promote input sparsity of the critic of the IPM. In the kernel version we show that SIC can be cast as a convex optimization problem by introducing auxiliary variables that play an important role in feature selection as they are normalized feature importance scores. We then present a neural version of SIC where the critic is parameterized as a homogeneous neural network, improving its representation power as well as its interpretability. We conduct experiments validating SIC for feature selection in synthetic and real-world experiments. We show that SIC enables reliable and interpretable discoveries, when used in conjunction with the holdout randomization test and knockoffs to control the False Discovery Rate. Code is available at http://github.com/ibm/sic.

0+
0+
下载
预览
Top