The generalized linear models (GLM) have been widely used in practice to model non-Gaussian response variables. When the number of explanatory features is relatively large, scientific researchers are of interest to perform controlled feature selection in order to simplify the downstream analysis. This paper introduces a new framework for feature selection in GLMs that can achieve false discovery rate (FDR) control in two asymptotic regimes. The key step is to construct a mirror statistic to measure the importance of each feature, which is based upon two (asymptotically) independent estimates of the corresponding true coefficient obtained via either the data-splitting method or the Gaussian mirror method. The FDR control is achieved by taking advantage of the mirror statistic's property that, for any null feature, its sampling distribution is (asymptotically) symmetric about 0. In the moderate-dimensional setting in which the ratio between the dimension (number of features) p and the sample size n converges to a fixed value, we construct the mirror statistic based on the maximum likelihood estimation. In the high-dimensional setting where p is much larger than n, we use the debiased Lasso to build the mirror statistic. Compared to the Benjamini-Hochberg procedure, which crucially relies on the asymptotic normality of the Z statistic, the proposed methodology is scale free as it only hinges on the symmetric property, thus is expected to be more robust in finite-sample cases. Both simulation results and a real data application show that the proposed methods are capable of controlling the FDR, and are often more powerful than existing methods including the Benjamini-Hochberg procedure and the knockoff filter.
Phishing is a type of social engineering attack with an intention to steal user data, including login credentials and credit card numbers, leading to financial losses for both organisations and individuals. It occurs when an attacker, pretending as a trusted entity, lure a victim into click on a link or attachment in an email, or in a text message. Phishing is often launched via email messages or text messages over social networks. Previous research has revealed that phishing attacks can be identified just by looking at URLs. Identifying the techniques which are used by phishers to mimic a phishing URL is rather a challenging issue. At present, we have limited knowledge and understanding of how cybercriminals attempt to mimic URLs with the same look and feel of the legitimate ones, to entice people into clicking links. Therefore, this paper investigates the feature selection of phishing URLs (Uniform Resource Locators), aiming to explore the strategies employed by phishers to mimic URLs that can obviously trick people into clicking links. We employed an Information Gain (IG) and Chi-Squared feature selection methods in Machine Learning (ML) on a phishing dataset. The dataset contains a total of 48 features extracted from 5000 phishing and another 5000 legitimate URL from web pages downloaded from January to May 2015 and from May to June 2017. Our results revealed that there were 10 techniques that phishers used to mimic URLs to manipulate humans into clicking links. Identifying these phishing URL manipulation techniques would certainly help to educate individuals and organisations and keep them safe from phishing attacks. In addition, the findings of this research will also help develop anti-phishing tools, framework or browser plugins for phishing prevention.
We introduce MANTRA, an annotated dataset of 4869 transient and 71207 non-transient object lightcurves built from the Catalina Real Time Transient Survey. We provide public access to this dataset as a plain text file to facilitate standardized quantitative comparison of astronomical transient event recognition algorithms. Some of the classes included in the dataset are: supernovae, cataclysmic variables, active galactic nuclei, high proper motion stars, blazars and flares. As an example of the tasks that can be performed on the dataset we experiment with multiple data pre-processing methods, feature selection techniques and popular machine learning algorithms (Support Vector Machines, Random Forests and Neural Networks). We assess quantitative performance in two classification tasks: binary (transient/non-transient) and eight-class classification. The best performing algorithm in both tasks is the Random Forest Classifier. It achieves an F1-score of 96.25% in the binary classification and 52.79% in the eight-class classification. For the eight-class classification, non-transients ( 96.83% ) is the class with the highest F1-score, while the lowest corresponds to high-proper-motion stars ( 16.79% ); for supernovae it achieves a value of 54.57% , close to the average across classes. The next release of MANTRA includes images and benchmarks with deep learning models.
Pursuing realistic results according to human visual perception is the central concern in the image transformation tasks. Perceptual learning approaches like perceptual loss are empirically powerful for such tasks but they usually rely on the pre-trained classification network to provide features, which are not necessarily optimal in terms of visual perception of image transformation. In this paper, we argue that, among the features representation from the pre-trained classification network, only limited dimensions are related to human visual perception, while others are irrelevant, although both will affect the final image transformation results. Under such an assumption, we try to disentangle the perception-relevant dimensions from the representation through our proposed online contrastive learning. The resulted network includes the pre-training part and a feature selection layer, followed by the contrastive learning module, which utilizes the transformed results, target images, and task-oriented distorted images as the positive, negative, and anchor samples, respectively. The contrastive learning aims at activating the perception-relevant dimensions and suppressing the irrelevant ones by using the triplet loss, so that the original representation can be disentangled for better perceptual quality. Experiments on various image transformation tasks demonstrate the superiority of our framework, in terms of human visual perception, to the existing approaches using pre-trained networks and empirically designed losses.
Optimal transport is a machine learning problem with applications including distribution comparison, feature selection, and generative adversarial networks. In this paper, we propose feature robust optimal transport (FROT) for high-dimensional data, which jointly solves feature selection and OT problems. Specifically, we formulate the FROT problem as a min--max optimization problem. Then, we propose a convex formulation of FROT and solve it with the Frank--Wolfe-based optimization algorithm, where the sub-problem can be efficiently solved using the Sinkhorn algorithm. A key advantage of FROT is that important features can be analytically determined by simply solving the convex optimization problem. Furthermore, we propose using the FROT algorithm for the layer selection problem in deep neural networks for semantic correspondence. By conducting synthetic and benchmark experiments, we demonstrate that the proposed method can determine important features. Additionally, we show that the FROT algorithm achieves a state-of-the-art performance in real-world semantic correspondence datasets.
One main obstacle for the wide use of deep learning in medical and engineering sciences is its interpretability. While neural network models are strong tools for making predictions, they often provide little information about which features play significant roles in influencing the prediction accuracy. To overcome this issue, many regularization procedures for learning with neural networks have been proposed for dropping non-significant features. Unfortunately, the lack of theoretical results casts doubt on the applicability of such pipelines. In this work, we propose and establish a theoretical guarantee for the use of the adaptive group lasso for selecting important features of neural networks. Specifically, we show that our feature selection method is consistent for single-output feed-forward neural networks with one hidden layer and hyperbolic tangent activation function. We demonstrate its applicability using both simulation and data analysis.
Interpretability is important for many applications of machine learning to signal data, covering aspects such as how well a model fits the data, how accurately explanations are drawn from it, and how well these can be understood by people. Feature extraction and selection can improve model interpretability by identifying structures in the data that are both informative and intuitively meaningful. To this end, we propose a signal classification framework that combines feature extraction with feature selection using the knockoff filter, a method which provides guarantees on the false discovery rate (FDR) amongst selected features. We apply this to a dataset of Raman spectroscopy measurements from bacterial samples. Using a wavelet-based feature representation of the data and a logistic regression classifier, our framework achieves significantly higher predictive accuracy compared to using the original features as input. Benchmarking was also done with features obtained through principal components analysis, as well as the original features input into a neural network-based classifier. Our proposed framework achieved better predictive performance at the former task and comparable performance at the latter task, while offering the advantage of a more compact and human-interpretable set of features.