Phylogenetic trees are simple models of evolutionary processes. They describe conditionally independent divergent evolution from common ancestors. However, they often lack the flexibility to represent processes like introgressive hybridization, which leads to gene flow between taxa. Phylogenetic networks generalize trees but typically assume that ancestral taxa merge instantaneously to form ``hybrid'' descendants. In contrast, convergence-divergence models retain a single underlying ``principal tree'' and permit gene flow over arbitrary time frames. They can also model other biological processes leading to taxa becoming more similar, such as replicated evolution. We present novel maximum likelihood algorithms to infer most aspects of $N$-taxon convergence-divergence models - many consistently - using a quartet-based approach. All algorithms use $4$-taxon convergence-divergence models, inferred from subsets of the $N$ taxa using a model selection criterion. The first algorithm infers an $N$-taxon principal tree; the second infers sets of converging taxa; and the third infers model parameters - root probabilities, edge lengths and convergence parameters. The algorithms can be applied to multiple sequence alignments restricted to genes or genomic windows or to gene presence/absence datasets. We demonstrate that convergence-divergence models can be accurately recovered from simulated data.
翻译:系统发育树是进化过程的简化模型,描述了从共同祖先出发的条件独立式发散演化。然而,这类模型通常缺乏灵活性,难以表征诸如渐渗杂交等导致类群间基因流动的过程。系统发育网络虽可视为树的推广,但通常假设祖先类群瞬时融合形成“杂交”后代。相比之下,收敛-发散模型保留了单一的底层“主树”结构,并允许基因在任意时间尺度上流动。该模型还能模拟其他导致类群趋同的生物过程,例如重复进化。本文提出基于四类群组的新型最大似然算法,通过四元组方法推断N-类群收敛-发散模型的大部分要素(多数可保持一致性)。所有算法均采用从N个类群子集中通过模型选择准则推断出的4-类群收敛-发散模型。第一项算法推断N-类群主树;第二项算法推断收敛类群集合;第三项算法推断模型参数——根节点概率、分支长度与收敛参数。这些算法可应用于限定于基因或基因组窗口的多序列比对数据,也可用于基因存在/缺失数据集。我们通过模拟数据验证了收敛-发散模型能够被准确重构。