当核函数相乘时，聚类统一：利用克罗内克积融合嵌入表示 (When Kernels Multiply, Clusters Unify: Fusing Embeddings with the Kronecker Product)

State-of-the-art embeddings often capture distinct yet complementary discriminative features: For instance, one image embedding model may excel at distinguishing fine-grained textures, while another focuses on object-level structure. Motivated by this observation, we propose a principled approach to fuse such complementary representations through kernel multiplication. Multiplying the kernel similarity functions of two embeddings allows their discriminative structures to interact, producing a fused representation whose kernel encodes the union of the clusters identified by each parent embedding. This formulation also provides a natural way to construct joint kernels for paired multi-modal data (e.g., image-text tuples), where the product of modality-specific kernels inherits structure from both domains. We highlight that this kernel product is mathematically realized via the Kronecker product of the embedding feature maps, yielding our proposed KrossFuse framework for embedding fusion. To address the computational cost of the resulting high-dimensional Kronecker space, we further develop RP-KrossFuse, a scalable variant that leverages random projections for efficient approximation. As a key application, we use this framework to bridge the performance gap between cross-modal embeddings (e.g., CLIP, BLIP) and unimodal experts (e.g., DINOv2, E5). Experiments show that RP-KrossFuse effectively integrates these models, enhancing modality-specific performance while preserving cross-modal alignment. The project code is available at https://github.com/yokiwuuu/KrossFuse.

翻译：当前最先进的嵌入表示通常捕获不同但互补的判别性特征：例如，一个图像嵌入模型可能擅长区分细粒度纹理，而另一个则专注于对象级结构。受此观察启发，我们提出了一种通过核乘法融合此类互补表示的原则性方法。将两个嵌入表示的核相似度函数相乘，可以使它们的判别结构相互作用，产生一种融合表示，其核编码了每个父嵌入所识别出的簇的并集。该公式还为构建配对多模态数据（例如，图像-文本元组）的联合核提供了一种自然方式，其中模态特定核的乘积继承了来自两个领域的结构。我们强调，该核乘积通过嵌入特征映射的克罗内克积在数学上实现，从而产生了我们提出的用于嵌入融合的KrossFuse框架。为了解决由此产生的高维克罗内克空间的计算成本，我们进一步开发了RP-KrossFuse，这是一种可扩展的变体，利用随机投影进行高效近似。作为一个关键应用，我们使用该框架来弥合跨模态嵌入（例如，CLIP、BLIP）与单模态专家模型（例如，DINOv2、E5）之间的性能差距。实验表明，RP-KrossFuse有效地整合了这些模型，在保持跨模态对齐的同时增强了模态特定性能。项目代码可在https://github.com/yokiwuuu/KrossFuse获取。