We study the problem of extrapolative controlled generation, i.e., generating sequences with attribute values beyond the range seen in training. This task is of significant importance in automated design, especially drug discovery, where the goal is to design novel proteins that are \textit{better} (e.g., more stable) than existing sequences. Thus, by definition, the target sequences and their attribute values are out of the training distribution, posing challenges to existing methods that aim to directly generate the target sequence. Instead, in this work, we propose Iterative Controlled Extrapolation (ICE) which iteratively makes local edits to a sequence to enable extrapolation. We train the model on synthetically generated sequence pairs that demonstrate small improvement in the attribute value. Results on one natural language task (sentiment analysis) and two protein engineering tasks (ACE2 stability and AAV fitness) show that ICE considerably outperforms state-of-the-art approaches despite its simplicity. Our code and models are available at: https://github.com/vishakhpk/iter-extrapolation.
翻译:----
我们研究了外推控制生成的问题,即生成具有训练中未见范围以外的属性值的序列。这个任务对于自动设计特别重要,尤其是在药物发现方面,其目标是设计比现有序列更好的新蛋白质,例如更稳定。因此,根据定义,目标序列及其属性值超出了训练分布范围,给现有方法带来了挑战,这些方法旨在直接生成目标序列。相反,在这项工作中,我们提出了迭代控制外推(ICE)方法,它通过迭代地对序列进行局部编辑来实现外推。我们使用在属性值上有小幅度改进的合成序列对训练模型进行训练。在一个自然语言任务(情感分析)和两个蛋白质工程任务(ACE2稳定性和AAV适合度)上的结果表明,ICE方法尽管简单,但在性能上明显优于现有的最先进方法。我们的代码和模型可以在https://github.com/vishakhpk/iter-extrapolation上获得。