The goal of lossy data compression is to reduce the storage cost of a data set $X$ while retaining as much information as possible about something ($Y$) that you care about. For example, what aspects of an image $X$ contain the most information about whether it depicts a cat? Mathematically, this corresponds to finding a mapping $X\to Z\equiv f(X)$ that maximizes the mutual information $I(Z,Y)$ while the entropy $H(Z)$ is kept below some fixed threshold. We present a method for mapping out the Pareto frontier for classification tasks, reflecting the tradeoff between retained entropy and class information. We first show how a random variable $X$ (an image, say) drawn from a class $Y\in\{1,...,n\}$ can be distilled into a vector $W=f(X)\in \mathbb{R}^{n-1}$ losslessly, so that $I(W,Y)=I(X,Y)$; for example, for a binary classification task of cats and dogs, each image $X$ is mapped into a single real number $W$ retaining all information that helps distinguish cats from dogs. For the $n=2$ case of binary classification, we then show how $W$ can be further compressed into a discrete variable $Z=g_\beta(W)\in\{1,...,m_\beta\}$ by binning $W$ into $m_\beta$ bins, in such a way that varying the parameter $\beta$ sweeps out the full Pareto frontier, solving a generalization of the Discrete Information Bottleneck (DIB) problem. We argue that the most interesting points on this frontier are "corners" maximizing $I(Z,Y)$ for a fixed number of bins $m=2,3...$ which can be conveniently be found without multiobjective optimization. We apply this method to the CIFAR-10, MNIST and Fashion-MNIST datasets, illustrating how it can be interpreted as an information-theoretically optimal image clustering algorithm.
翻译:丢失数据压缩的目的是降低一个数据集的存储成本 $X, 同时保留尽可能多的关于您所关心的某事物的信息。 例如, 一个图像的哪些方面包含关于它是否描绘猫的信息? 从数学角度来说, 这相当于找到一个映射 $X\to {equiv f( X) 美元, 使相互信息最大化 $I( Z, Y) 美元, 而英特普 $( 美元) 保持在某种固定的阈值以下。 我们提出了一个方法, 用来绘制用于分类任务的巴雷托边框, 反映留存的英特罗比和类信息的交易。 例如, 我们首先展示了一个随机的 $X$( 一种图像, 说) 如何从一个类的 $Y\ i2,..., n$ (x) 可以蒸馏成一个矢量 (x) 问题 \\\ 美元 美元 的, 这样一美元, 我们的士地平分解 美元 。