深度学习类型推断系统的跨领域评估 (Cross-Domain Evaluation of a Deep Learning-Based Type Inference System)

Optional type annotations allow for enriching dynamic programming languages with static typing features like better Integrated Development Environment (IDE) support, more precise program analysis, and early detection and prevention of type-related runtime errors. Machine learning-based type inference promises interesting results for automating this task. However, the practical usage of such systems depends on their ability to generalize across different domains, as they are often applied outside their training domain. In this work, we investigate Type4Py as a representative of state-of-the-art deep learning-based type inference systems, by conducting extensive cross-domain experiments. Thereby, we address the following problems: class imbalances, out-of-vocabulary words, dataset shifts, and unknown classes. To perform such experiments, we use the datasets ManyTypes4Py and CrossDomainTypes4Py. The latter we introduce in this paper. Our dataset enables the evaluation of type inference systems in different domains of software projects and has over 1,000,000 type annotations mined on the platforms GitHub and Libraries. It consists of data from the two domains web development and scientific calculation. Through our experiments, we detect that the shifts in the dataset and the long-tailed distribution with many rare and unknown data types decrease the performance of the deep learning-based type inference system drastically. In this context, we test unsupervised domain adaptation methods and fine-tuning to overcome these issues. Moreover, we investigate the impact of out-of-vocabulary words.

翻译：可选类型注释允许在动态编程语言中添加静态类型特性，例如更好的集成开发环境（IDE）支持、更精确的程序分析以及类型相关的运行时错误的早期检测和预防。基于机器学习的类型推断为自动化此任务提供了有趣的结果。然而，这种系统的实际用途取决于它们在不同领域的推理能力，因为它们经常应用于其训练领域之外。在本研究中，我们通过进行广泛的跨领域实验，使用Type4Py作为最先进的基于深度学习的类型推理系统进行研究。因此，我们解决了以下问题：类别不平衡，词汇溢出，数据集偏移和未知类别。为了进行这些实验，我们使用了ManyTypes4Py和CrossDomainTypes4Py数据集。我们在本文中介绍后者。我们的数据集可以评估不同领域软件项目中的类型推断系统，拥有来自Web开发和科学计算两个领域的GitHub和Libraries上挖掘的超过1,000,000个类型注释数据。通过我们的实验，我们发现数据集中的偏移和长尾分布，即包含许多罕见和未知数据类型，会大大降低基于深度学习的类型推断系统的性能。在这种情况下，我们测试了无监督领域适应方法和微调来克服这些问题。此外，我们研究了词汇溢出的影响。