Our goal is to develop a general strategy to decompose a random variable $X$ into multiple independent random variables, without sacrificing any information about unknown parameters. A recent paper showed that for some well-known natural exponential families, $X$ can be "thinned" into independent random variables $X^{(1)}, \ldots, X^{(K)}$, such that $X = \sum_{k=1}^K X^{(k)}$. These independent random variables can then be used for various model validation and inference tasks, including in contexts where traditional sample splitting fails. In this paper, we generalize their procedure by relaxing this summation requirement and simply asking that some known function of the independent random variables exactly reconstruct $X$. This generalization of the procedure serves two purposes. First, it greatly expands the families of distributions for which thinning can be performed. Second, it unifies sample splitting and data thinning, which on the surface seem to be very different, as applications of the same principle. This shared principle is sufficiency. We use this insight to perform generalized thinning operations for a diverse set of families.
翻译:本文的目标是提出一种通用策略,将随机变量$X$分解为多个相互独立的随机变量,同时不损失任何关于未知参数的信息。近期研究表明,对于某些著名的自然指数族分布,$X$可被"细化"为独立随机变量$X^{(1)}, \ldots, X^{(K)}$,使得$X = \sum_{k=1}^K X^{(k)}$成立。这些独立随机变量可进一步应用于多种模型验证与统计推断任务,包括传统样本分割方法失效的场景。本文通过放宽求和约束推广了该流程,仅要求通过独立随机变量的某个已知函数精确重构$X$。该推广方案具有双重意义:首先,它显著扩展了可实施细化操作的分布族范围;其次,它将表面上差异显著的样本分割与数据细化方法统一为同一原理的应用实例。这一共同原理即为充分性原理。基于该洞见,我们为多种分布族构建了广义细化操作体系。