r-enum算法再探：面向上下文敏感重复序列与净频率的加速与扩展 (R-enum Revisited: Speedup and Extension for Context-Sensitive Repeats and Net Frequencies)

Nishimoto and Tabei [CPM, 2021] proposed r-enum, an algorithm to enumerate various characteristic substrings, including maximal repeats, in a string $T$ of length $n$ in $O(r)$ words of compressed working space, where $r \le n$ is the number of runs in the Burrows-Wheeler transform (BWT) of $T$. Given the run-length encoded BWT (RLBWT) of $T$, r-enum runs in $O(n\log\log_{w}(n/r))$ time in addition to the time linear to the number of output strings, where $w=Θ(\log n)$ is the word size. In this paper, we improve the $O(n\log\log_{w}(n/r))$ term to $O(n)$. We also extend r-enum to compute other context-sensitive repeats such as near-supermaximal repeats (NSMRs) and supermaximal repeats, and the context diversity for every maximal repeat in the same complexities. Furthermore, we study the occurrences that witness NSMRs, which have recently attracted attention under the name of net occurrences: An occurrence of a repeat is called a net occurrence if it is not covered by another repeat, and the net frequency of a repeat is the number of its net occurrences. With this terminology, an NSMR is defined to be a repeat with a positive net frequency. Given the RLBWT of $T$, we show how to compute the set $S^{nsmr}$ of all NSMRs in $T$ together with their net frequency/occurrences in $O(n)$ time and $O(r)$ space. We also show that an $O(r)$-space data structure can be built from the RLBWT to support queries of computing the net frequency of any query pattern $P$ in $O(|P|)$ time. The data structure is built in $O(r)$ space and in $O(n)$ time with high probability or deterministic $O(n+|S^{nsmr}|\log\log\min(σ,|S^{nsmr}|))$ time, where $σ\le r$ is the alphabet size of $T$. To achieve this, we prove that the total number of net occurrences is less than $2r$. We also get a new upper bound $2r$ of the number of minimal unique substrings in $T$, which may be of independent interest.

翻译：Nishimoto与Tabei [CPM, 2021] 提出了r-enum算法，用于枚举字符串$T$（长度为$n$）中的各类特征子串，包括最大重复序列。该算法在$O(r)$字的压缩工作空间内运行，其中$r \le n$为$T$的Burrows-Wheeler变换（BWT）的游程数。给定$T$的游程编码BWT（RLBWT），r-enum在输出字符串数量线性时间的基础上，额外需要$O(n\log\log_{w}(n/r))$时间，其中$w=Θ(\log n)$为字长。本文中，我们将$O(n\log\log_{w}(n/r))$项改进至$O(n)$。同时，我们将r-enum扩展至计算其他上下文敏感重复序列，如近超最大重复序列（NSMRs）与超最大重复序列，并以相同复杂度计算每个最大重复序列的上下文多样性。此外，我们研究了见证NSMRs的出现位置（近期以净出现为名受到关注）：若一个重复序列的出现未被其他重复序列覆盖，则称为净出现；重复序列的净频率即其净出现次数。在此术语下，NSMR被定义为具有正净频率的重复序列。给定$T$的RLBWT，我们展示了如何在$O(n)$时间与$O(r)$空间内计算$T$中所有NSMR的集合$S^{nsmr}$及其净频率/出现。我们还证明，可从RLBWT构建$O(r)$空间的数据结构，支持在$O(|P|)$时间内计算任意查询模式$P$的净频率。该数据结构以$O(r)$空间构建，时间复杂度为高概率$O(n)$或确定性$O(n+|S^{nsmr}|\log\log\min(σ,|S^{nsmr}|))$，其中$σ\le r$为$T$的字母表大小。为实现此目标，我们证明了净出现总次数小于$2r$。同时，我们得到了$T$中最小唯一子串数量的新上界$2r$，这可能具有独立研究价值。