We present a new algorithmic framework for grouped variable selection that is based on discrete mathematical optimization. While there exist several appealing approaches based on convex relaxations and nonconvex heuristics, we focus on optimal solutions for the $\ell_0$-regularized formulation, a problem that is relatively unexplored due to computational challenges. Our methodology covers both high-dimensional linear regression and nonparametric sparse additive modeling with smooth components. Our algorithmic framework consists of approximate and exact algorithms. The approximate algorithms are based on coordinate descent and local search, with runtimes comparable to popular sparse learning algorithms. Our exact algorithm is based on a standalone branch-and-bound (BnB) framework, which can solve the associated mixed integer programming (MIP) problem to certified optimality. By exploiting the problem structure, our custom BnB algorithm can solve to optimality problem instances with $5 \times 10^6$ features and $10^3$ observations in minutes to hours -- over $1000$ times larger than what is currently possible using state-of-the-art commercial MIP solvers. We also explore statistical properties of the $\ell_0$-based estimators. We demonstrate, theoretically and empirically, that our proposed estimators have an edge over popular group-sparse estimators in terms of statistical performance in various regimes. We provide an open-source implementation of our proposed framework.

0
下载
关闭预览

相关内容

Hierarchical forecasting problems arise when time series have a natural group structure, and predictions at multiple levels of aggregation and disaggregation across the groups are needed. In such problems, it is often desired to satisfy the aggregation constraints in a given hierarchy, referred to as hierarchical coherence in the literature. Maintaining hierarchical coherence while producing accurate forecasts can be a challenging problem, especially in the case of probabilistic forecasting. We present a novel method capable of accurate and coherent probabilistic forecasts for hierarchical time series. We call it Deep Poisson Mixture Network (DPMN). It relies on the combination of neural networks and a statistical model for the joint distribution of the hierarchical multivariate time series structure. By construction, the model guarantees hierarchical coherence and provides simple rules for aggregation and disaggregation of the predictive distributions. We perform an extensive empirical evaluation comparing the DPMN to other state-of-the-art methods which produce hierarchically coherent probabilistic forecasts on multiple public datasets. Compared to existing coherent probabilistic models, we obtained a relative improvement in the overall Continuous Ranked Probability Score (CRPS) of 17.1% on Australian domestic tourism data, 24.2 on the Favorita grocery sales dataset, and 6.9% on a San Francisco Bay Area highway traffic dataset.

0
0
下载
预览

Models with intractable normalising functions have numerous applications ranging from network models to image analysis to spatial point processes. Because the normalising constants are functions of the parameters of interest, standard Markov chain Monte Carlo cannot be used for Bayesian inference for these models. A number of algorithms have been developed for such models. Some have the posterior distribution as the asymptotic distribution. Other "asymptotically inexact" algorithms do not possess this property. There is limited guidance for evaluating approximations based on these algorithms, and hence it is very hard to tune them. We propose two new diagnostics that address these problems for intractable normalising function models. Our first diagnostic, inspired by the second Bartlett identity, is, in principle, applicable in most any likelihood-based context where misspecification is of concern. We develop an approximate version that is applicable to intractable normalising function problems. Our second diagnostic is a Monte Carlo approximation to a kernel Stein discrepancy-based diagnostic introduced by Gorham and Mackey (2017). We provide theoretical justification for our methods and apply them to several algorithms in the context of challenging simulated and real data examples including an Ising model, an exponential random graph model, and a Markov point process.

0
0
下载
预览

Bayesian quadrature (BQ) is a method for solving numerical integration problems in a Bayesian manner, which allows users to quantify their uncertainty about the solution. The standard approach to BQ is based on a Gaussian process (GP) approximation of the integrand. As a result, BQ is inherently limited to cases where GP approximations can be done in an efficient manner, thus often prohibiting very high-dimensional or non-smooth target functions. This paper proposes to tackle this issue with a new Bayesian numerical integration algorithm based on Bayesian Additive Regression Trees (BART) priors, which we call BART-Int. BART priors are easy to tune and well-suited for discontinuous functions. We demonstrate that they also lend themselves naturally to a sequential design setting and that explicit convergence rates can be obtained in a variety of settings. The advantages and disadvantages of this new methodology are highlighted on a set of benchmark tests including the Genz functions, and on a Bayesian survey design problem.

0
0
下载
预览

In calculating integral or discrete transforms, use has been made of fast algorithms for multiplying vectors by matrices whose elements are specified as values of special (Chebyshev, Legendre, Laguerre, etc.) functions. The currently available fast algorithms are several orders of magnitude less efficient than the fast Fourier transform. To achieve higher efficiency, a convenient general approach for calculating matrix-vector products for some class of problems is proposed. A series of fast simple-structure algorithms developed under this approach can be efficiently implemented with software based on modern microprocessors. The method has a pre-computation complexity of $O(N^2 \log N)$ and an execution complexity of $O(N \log N)$. The results of computational experiments with the algorithms show that these procedures can decrease the calculation time by several orders of magnitude compared with a conventional direct method of matrix-vector multiplication.

0
0
下载
预览

Evaluation of treatment effects and more general estimands is typically achieved via parametric modelling, which is unsatisfactory since model misspecification is likely. Data-adaptive model building (e.g. statistical/machine learning) is commonly employed to reduce the risk of misspecification. Naive use of such methods, however, delivers estimators whose bias may shrink too slowly with sample size for inferential methods to perform well, including those based on the bootstrap. Bias arises because standard data-adaptive methods are tuned towards minimal prediction error as opposed to e.g. minimal MSE in the estimator. This may cause excess variability that is difficult to acknowledge, due to the complexity of such strategies. Building on results from non-parametric statistics, targeted learning and debiased machine learning overcome these problems by constructing estimators using the estimand's efficient influence function under the non-parametric model. These increasingly popular methodologies typically assume that the efficient influence function is given, or that the reader is familiar with its derivation. In this paper, we focus on derivation of the efficient influence function and explain how it may be used to construct statistical/machine-learning-based estimators. We discuss the requisite conditions for these estimators to perform well and use diverse examples to convey the broad applicability of the theory.

0
0
下载
预览

This paper deals with the grouped variable selection problem. A widely used strategy is to equip the loss function with a sparsity-promoting penalty. Existing methods include the group Lasso, group SCAD, and group MCP. The group Lasso solves a convex optimization problem but is plagued by underestimation bias. The group SCAD and group MCP avoid the estimation bias but require solving a non-convex optimization problem that suffers from local optima. In this work, we propose an alternative method based on the generalized minimax concave (GMC) penalty, which is a folded concave penalty that can maintain the convexity of the objective function. We develop a new method for grouped variable selection in linear regression, the group GMC, that generalizes the strategy of the original GMC estimator. We present an efficient algorithm for computing the group GMC estimator. We also prove properties of the solution path to guide its numerical computation and tuning parameter selection in practice. We establish error bounds for both the group GMC and original GMC estimators. A rich set of simulation studies and a real data application indicate that the proposed group GMC approach outperforms existing methods in several different aspects under a wide array of scenarios.

0
0
下载
预览

We present a novel class of projected methods, to perform statistical analysis on a data set of probability distributions on the real line, with the 2-Wasserstein metric. We focus in particular on Principal Component Analysis (PCA) and regression. To define these models, we exploit a representation of the Wasserstein space closely related to its weak Riemannian structure, by mapping the data to a suitable linear space and using a metric projection operator to constrain the results in the Wasserstein space. By carefully choosing the tangent point, we are able to derive fast empirical methods, exploiting a constrained B-spline approximation. As a byproduct of our approach, we are also able to derive faster routines for previous work on PCA for distributions. By means of simulation studies, we compare our approaches to previously proposed methods, showing that our projected PCA has similar performance for a fraction of the computational cost and that the projected regression is extremely flexible even under misspecification. Several theoretical properties of the models are investigated and asymptotic consistency is proven. Two real world applications to Covid-19 mortality in the US and wind speed forecasting are discussed.

0
0
下载
预览

We investigate a clustering problem with data from a mixture of Gaussians that share a common but unknown, and potentially ill-conditioned, covariance matrix. We start by considering Gaussian mixtures with two equally-sized components and derive a Max-Cut integer program based on maximum likelihood estimation. We prove its solutions achieve the optimal misclassification rate when the number of samples grows linearly in the dimension, up to a logarithmic factor. However, solving the Max-cut problem appears to be computationally intractable. To overcome this, we develop an efficient spectral algorithm that attains the optimal rate but requires a quadratic sample size. Although this sample complexity is worse than that of the Max-cut problem, we conjecture that no polynomial-time method can perform better. Furthermore, we gather numerical and theoretical evidence that supports the existence of a statistical-computational gap. Finally, we generalize the Max-Cut program to a $k$-means program that handles multi-component mixtures with possibly unequal weights. It enjoys similar optimality guarantees for mixtures of distributions that satisfy a transportation-cost inequality, encompassing Gaussian and strongly log-concave distributions.

0
0
下载
预览

Influence maximization is the task of selecting a small number of seed nodes in a social network to maximize the spread of the influence from these seeds, and it has been widely investigated in the past two decades. In the canonical setting, the whole social network as well as its diffusion parameters is given as input. In this paper, we consider the more realistic sampling setting where the network is unknown and we only have a set of passively observed cascades that record the set of activated nodes at each diffusion step. We study the task of influence maximization from these cascade samples (IMS), and present constant approximation algorithms for this task under mild conditions on the seed set distribution. To achieve the optimization goal, we also provide a novel solution to the network inference problem, that is, learning diffusion parameters and the network structure from the cascade data. Comparing with prior solutions, our network inference algorithm requires weaker assumptions and does not rely on maximum-likelihood estimation and convex programming. Our IMS algorithms enhance the learning-and-then-optimization approach by allowing a constant approximation ratio even when the diffusion parameters are hard to learn, and we do not need any assumption related to the network structure or diffusion parameters.

0
4
下载
预览

Because of continuous advances in mathematical programing, Mix Integer Optimization has become a competitive vis-a-vis popular regularization method for selecting features in regression problems. The approach exhibits unquestionable foundational appeal and versatility, but also poses important challenges. We tackle these challenges, reducing computational burden when tuning the sparsity bound (a parameter which is critical for effectiveness) and improving performance in the presence of feature collinearity and of signals that vary in nature and strength. Importantly, we render the approach efficient and effective in applications of realistic size and complexity - without resorting to relaxations or heuristics in the optimization, or abandoning rigorous cross-validation tuning. Computational viability and improved performance in subtler scenarios is achieved with a multi-pronged blueprint, leveraging characteristics of the Mixed Integer Programming framework and by means of whitening, a data pre-processing step.

0
5
下载
预览
小贴士
相关主题
相关论文
Probabilistic Hierarchical Forecasting with Deep Poisson Mixtures
Kin G. Olivares,O. Nganba Meetei,Ruijun Ma,Rohan Reddy,Mengfei Cao,Lee Dicker
0+阅读 · 12月2日
Bokgyeong Kang,John Hughes,Murali Haran
0+阅读 · 12月2日
Harrison Zhu,Xing Liu,Ruya Kang,Zhichao Shen,Seth Flaxman,François-Xavier Briol
0+阅读 · 12月2日
Oliver Hines,Oliver Dukes,Karla Diaz-Ordaz,Stijn Vansteelandt
0+阅读 · 12月1日
Xiaoqian Liu,Aaron J. Molstad,Eric C. Chi
0+阅读 · 11月30日
Damek Davis,Mateo Díaz,Kaizheng Wang
0+阅读 · 11月29日
Wei Chen,Xiaoming Sun,Jialin Zhang,Zhijie Zhang
4+阅读 · 6月7日
Efficient and Effective $L_0$ Feature Selection
Ana Kenney,Francesca Chiaromonte,Giovanni Felici
5+阅读 · 2018年8月7日
相关资讯
Hierarchically Structured Meta-learning
CreateAMind
12+阅读 · 2019年5月22日
计算机 | CCF推荐期刊专刊信息5条
Call4Papers
3+阅读 · 2019年4月10日
强化学习的Unsupervised Meta-Learning
CreateAMind
7+阅读 · 2019年1月7日
Unsupervised Learning via Meta-Learning
CreateAMind
32+阅读 · 2019年1月3日
Disentangled的假设的探讨
CreateAMind
8+阅读 · 2018年12月10日
Hierarchical Disentangled Representations
CreateAMind
3+阅读 · 2018年4月15日
随波逐流:Similarity-Adaptive and Discrete Optimization
我爱读PAMI
4+阅读 · 2018年2月6日
【论文】变分推断(Variational inference)的总结
机器学习研究会
24+阅读 · 2017年11月16日
【学习】Hierarchical Softmax
机器学习研究会
3+阅读 · 2017年8月6日
Top