Floating-point inconsistencies across compilers can undermine the reliability of numerical software. We present LLM4FP, the first framework that uses Large Language Models (LLMs) to generate floating-point programs specifically designed to trigger such inconsistencies. LLM4FP combines Grammar-Based Generation and Feedback-Based Mutation to produce diverse and valid programs. We evaluate LLM4FP across multiple compilers and optimization levels, measuring inconsistency rate, time cost, and program diversity. LLM4FP detects nearly 2.5x the number of inconsistencies as the state-of-the-art tool Varity. Notably, most of the inconsistencies involve real-valued differences, rather than extreme values like NaN or infinities. LLM4FP also uncovers inconsistencies across a wider range of optimization levels, and finds the most mismatches between host and device compilers. These results show that LLM-guided program generation improves the detection of numerical inconsistencies. In practice, numerical software and HPC developers can use LLM4FP to compare compilers and select those that provide more accurate and consistent floating-point behavior, while compiler developers can use it to identify and address subtle consistency issues in their implementations.
翻译:跨编译器的浮点不一致性可能损害数值软件的可靠性。本文提出LLM4FP,首个利用大语言模型生成专门用于触发此类不一致性的浮点程序的框架。LLM4FP结合基于语法的生成和基于反馈的变异,以产生多样且有效的程序。我们在多种编译器和优化级别上评估LLM4FP,测量不一致率、时间成本和程序多样性。LLM4FP检测到的不一致数量约为最先进工具Varity的2.5倍。值得注意的是,大多数不一致涉及实数值差异,而非NaN或无穷大等极端值。LLM4FP还揭示了更广泛优化级别上的不一致性,并发现了主机与设备编译器之间最多的不匹配情况。这些结果表明,LLM引导的程序生成提升了数值不一致性的检测能力。实践中,数值软件和HPC开发者可使用LLM4FP比较编译器,选择提供更准确、一致浮点行为的编译器;而编译器开发者则可利用它识别并解决实现中的细微一致性问题。