We explore the limitations of and best practices for using black-box variational inference to estimate posterior summaries of the model parameters. By taking an importance sampling perspective, we are able to explain and empirically demonstrate: 1) why the intuitions about the behavior of approximate families and divergences for low-dimensional posteriors fail for higher-dimensional posteriors, 2) how we can diagnose the pre-asymptotic reliability of variational inference in practice by examining the behavior of the density ratios (i.e., importance weights), 3) why the choice of variational objective is not as relevant for higher-dimensional posteriors, and 4) why, although flexible variational families can provide some benefits in higher dimensions, they also introduce additional optimization challenges. Based on these findings, for high-dimensional posteriors we recommend using the exclusive KL divergence that is most stable and easiest to optimize, and then focusing on improving the variational family or using model parameter transformations to make the posterior more similar to the approximating family. Our results also show that in low to moderate dimensions, heavy-tailed variational families and mass-covering divergences can increase the chances that the approximation can be improved by importance sampling.
In modern contexts, some types of data are observed in high-resolution, essentially continuously in time. Such data units are best described as taking values in a space of functions. Subject units carrying the observations may have intrinsic relations among themselves, and are best described by the nodes of a large graph. It is often sensible to think that the underlying signals in these functional observations vary smoothly over the graph, in that neighboring nodes have similar underlying signals. This qualitative information allows borrowing of strength over neighboring nodes and consequently leads to more accurate inference. In this paper, we consider a model with Gaussian functional observations and adopt a Bayesian approach to smoothing over the nodes of the graph. We characterize the minimax rate of estimation in terms of the regularity of the signals and their variation across nodes quantified in terms of the graph Laplacian. We show that an appropriate prior constructed from the graph Laplacian can attain the minimax bound, while using a mixture prior, the minimax rate up to a logarithmic factor can be attained simultaneously for all possible values of functional and graphical smoothness. We also show that in the fixed smoothness setting, an optimal sized credible region has arbitrarily high frequentist coverage. A simulation experiment demonstrates that the method performs better than potential competing methods like the random forest. The method is also applied to a dataset on daily temperatures measured at several weather stations in the US state of North Carolina.
Consider the setting where there are B>1 candidate statistical models, and one is interested in model selection. Two common approaches to solve this problem are to select a single model or to combine the candidate models through model averaging. Instead, we select a subset of the combined parameter space associated with the models. Specifically, a model averaging perspective is used to increase the parameter space, and a model selection criterion is used to select a subset of this expanded parameter space. We account for the variability of the criterion by adapting Yekutieli (2012)'s method to Bayesian model averaging (BMA). Yekutieli (2012)'s method treats model selection as a truncation problem. We truncate the joint support of the data and the parameter space to only include small values of the covariance penalized error (CPE) criterion. The CPE is a general expression that contains several information criteria as special cases. Simulation results show that as long as the truncated set does not have near zero probability, we tend to obtain lower mean squared error than BMA. Additional theoretical results are provided that provide the foundation for these observations. We apply our approach to a dataset consisting of American Community Survey (ACS) period estimates to illustrate that this perspective can lead to improvements of a single model.
Given two relations containing multiple measurements - possibly with uncertainties - our objective is to find which sets of attributes from the first have a corresponding set on the second, using exclusively a sample of the data. This approach could be used even when the associated metadata is damaged, missing or incomplete, or when the volume is too big for exact methods. This problem is similar to the search of Inclusion Dependencies (IND), a type of rule over two relations asserting that for a set of attributes X from the first, every combination of values appears on a set Y from the second. Existing IND can be found exploiting the existence of a partial order relation called specialization. However, this relation is based on set theory, requiring the values to be directly comparable. Statistical tests are an intuitive possible replacement, but it has not been studied how would they affect the underlying assumptions. In this paper we formally review the effect that a statistical approach has over the inference rules applied to IND discovery. Our results confirm the intuitive thought that statistical tests can be used, but not in a directly equivalent manner. We provide a workable alternative based on a "hierarchy of null hypotheses", allowing for the automatic discovery of multi-dimensional equally distributed sets of attributes.
Models with a large number of latent variables are often used to fully utilize the information in big or complex data. However, they can be difficult to estimate using standard approaches, and variational inference methods are a popular alternative. Key to the success of these is the selection of an approximation to the target density that is accurate, tractable and fast to calibrate using optimization methods. Most existing choices can be inaccurate or slow to calibrate when there are many latent variables. Here, we propose a family of tractable variational approximations that are more accurate and faster to calibrate for this case. It combines a parsimonious parametric approximation for the parameter posterior, with the exact conditional posterior of the latent variables. We derive a simplified expression for the re-parameterization gradient of the variational lower bound, which is the main ingredient of efficient optimization algorithms used to implement variational estimation. To do so only requires the ability to generate exactly or approximately from the conditional posterior of the latent variables, rather than to compute its density. We illustrate using two complex contemporary econometric examples. The first is a nonlinear multivariate state space model for U.S. macroeconomic variables. The second is a random coefficients tobit model applied to two million sales by 20,000 individuals in a large consumer panel from a marketing study. In both cases, we show that our approximating family is considerably more accurate than mean field or structured Gaussian approximations, and faster than Markov chain Monte Carlo. Last, we show how to implement data sub-sampling in variational inference for our approximation, which can lead to a further reduction in computation time. MATLAB code implementing the method for our examples is included in supplementary material.
False negative errors are of major concern in applications where missing a high proportion of true signals may cause serious consequences. False negative control, however, raises a bottleneck challenge in high-dimensional inference when signals are not identifiable at individual levels. We develop a new analytic framework to regulate false negative errors under measures tailored towards modern applications with high-dimensional data. A new method is proposed in realistic settings with arbitrary covariance dependence between variables. We explicate the joint effects of covariance dependence and signal sparsity on the new method and interpret the results using a phase diagram. It shows that signals that are not individually identifiable can be effectively retained by the proposed method without incurring excessive false positives. Simulation studies are conducted to compare the new method with several existing methods. The new method outperforms the others in adapting to a user-specified false negative control level. We apply the new method to analyze an fMRI dataset to locate voxels that are functionally relevant to saccadic eye movements. The new method exhibits a nice balance in retaining signal voxels and avoiding excessive noise voxels.
$\alpha$-posteriors and their variational approximations distort standard posterior inference by downweighting the likelihood and introducing variational approximation errors. We show that such distortions, if tuned appropriately, reduce the Kullback-Leibler (KL) divergence from the true, but perhaps infeasible, posterior distribution when there is potential parametric model misspecification. To make this point, we derive a Bernstein-von Mises theorem showing convergence in total variation distance of $\alpha$-posteriors and their variational approximations to limiting Gaussian distributions. We use these distributions to evaluate the KL divergence between true and reported posteriors. We show this divergence is minimized by choosing $\alpha$ strictly smaller than one, assuming there is a vanishingly small probability of model misspecification. The optimized value becomes smaller as the the misspecification becomes more severe. The optimized KL divergence increases logarithmically in the degree of misspecification and not linearly as with the usual posterior.
We consider the problem of jointly modeling and clustering populations of tensors by introducing a flexible high-dimensional tensor mixture model with heterogeneous covariances. The proposed mixture model exploits the intrinsic structures of tensor data, and is assumed to have means that are low-rank and internally sparse as well as heterogeneous covariances that are separable and conditionally sparse. We develop an efficient high-dimensional expectation-conditional-maximization (HECM) algorithm that breaks the challenging optimization in the M-step into several simpler conditional optimization problems, each of which is convex, admits regularization and has closed-form updating formulas. We show that the proposed HECM algorithm, with an appropriate initialization, converges geometrically to a neighborhood that is within statistical precision of the true parameter. Such a theoretical analysis is highly nontrivial due to the dual non-convexity arising from both the EM-type estimation and the non-convex objective function in the M-step. The efficacy of our proposed method is demonstrated through simulation studies and an application to an autism spectrum disorder study, where our analysis identifies important brain regions for diagnosis.
Category recommendation for users on an e-Commerce platform is an important task as it dictates the flow of traffic through the website. It is therefore important to surface precise and diverse category recommendations to aid the users' journey through the platform and to help them discover new groups of items. An often understated part in category recommendation is users' proclivity to repeat purchases. The structure of this temporal behavior can be harvested for better category recommendations and in this work, we attempt to harness this through variational inference. Further, to enhance the variational inference based optimization, we initialize the optimizer at better starting points through the well known Metapath2Vec algorithm. We demonstrate our results on two real-world datasets and show that our model outperforms standard baseline methods.
Amortized inference has led to efficient approximate inference for large datasets. The quality of posterior inference is largely determined by two factors: a) the ability of the variational distribution to model the true posterior and b) the capacity of the recognition network to generalize inference over all datapoints. We analyze approximate inference in variational autoencoders in terms of these factors. We find that suboptimal inference is often due to amortizing inference rather than the limited complexity of the approximating distribution. We show that this is due partly to the generator learning to accommodate the choice of approximation. Furthermore, we show that the parameters used to increase the expressiveness of the approximation play a role in generalizing inference rather than simply improving the complexity of the approximation.
Robust estimation is much more challenging in high dimensions than it is in one dimension: Most techniques either lead to intractable optimization problems or estimators that can tolerate only a tiny fraction of errors. Recent work in theoretical computer science has shown that, in appropriate distributional models, it is possible to robustly estimate the mean and covariance with polynomial time algorithms that can tolerate a constant fraction of corruptions, independent of the dimension. However, the sample and time complexity of these algorithms is prohibitively large for high-dimensional applications. In this work, we address both of these issues by establishing sample complexity bounds that are optimal, up to logarithmic factors, as well as giving various refinements that allow the algorithms to tolerate a much larger fraction of corruptions. Finally, we show on both synthetic and real data that our algorithms have state-of-the-art performance and suddenly make high-dimensional robust estimation a realistic possibility.