We consider the $\textit{Similarity Sketching}$ problem: Given a universe $[u] = \{0,\ldots, u-1\}$ we want a random function $S$ mapping subsets $A\subseteq [u]$ into vectors $S(A)$ of size $t$, such that similarity is preserved. More precisely: Given sets $A,B\subseteq [u]$, define $X_i = [S(A)[i] = S(B)[i]]$ and $X = \sum_{i\in [t]} X_i$. We want to have $E[X] = t\cdot J(A,B)$, where $J(A,B) = |A\cap B|/|A\cup B|$ and furthermore to have strong concentration guarantees (i.e. Chernoff-style bounds) for $X$. This is a fundamental problem which has found numerous applications in data mining, large-scale classification, computer vision, similarity search, etc. via the classic MinHash algorithm. The vectors $S(A)$ are also called $\textit{sketches}$. The seminal $t\times\textit{MinHash}$ algorithm uses $t$ random hash functions $h_1,\ldots, h_t$, and stores $\left ( \min_{a\in A} h_1(A),\ldots, \min_{a\in A} h_t(A) \right )$ as the sketch of $A$. The main drawback of MinHash is, however, its $O(t\cdot |A|)$ running time, and finding a sketch with similar properties and faster running time has been the subject of several papers. Addressing this, Li et al.~[NIPS'12] introduced $\textit{one permutation hashing (OPH)}$, which creates a sketch of size $t$ in $O(t + |A|)$ time, but with the drawback that possibly some of the $t$ entries are ``empty'' when $|A| = O(t)$. One could argue that sketching is not necessary in this case, however the desire in most applications is to have $\textit{one}$ sketching procedure that works for sets of all sizes. Therefore, filling out these empty entries is the subject of several follow-up papers initiated by Shrivastava and Li~[ICML'14]. However, these ``densification'' schemes fail to provide good concentration bounds exactly in the case $|A| = O(t)$, where they are needed. (continued...)
翻译:暂无翻译