The growing complexity of modern software systems has highlighted the shortcomings of traditional programming analysis techniques, particularly for Software Engineering (SE) tasks. While machine learning and Large Language Models (LLMs) offer promising solutions, their effectiveness is limited by the way they interpret data. Unlike natural language, source code meaning is defined less by token adjacency and more by complex, long-range, and structural relationships and dependencies. This limitation is especially pronounced for C and C++, where flatter syntactic hierarchies, pointer aliasing, multi-level indirection, typedef-based type obfuscation, and function-pointer calls hinder accurate static analysis. To address these challenges, this paper introduces ATLAS, a Python-based Command-Line Interface (CLI) that (i) generates statement-level Control Flow Graphs (CFG) and type-aware Data Flow Graphs (DFG) that capture inter-functional dependencies for the entire program; (ii) has the ability to work on entire C and C++ projects comprising multiple files; (iii) works on both compilable and non-compilable code and (iv) produces a unified multi-view code representation using Abstract Syntax Trees (AST), CFG and DFG. By preserving essential structural and semantic information, ATLAS provides a practical foundation for improving downstream SE and machine-learning-based program understanding. Video demonstration: https://youtu.be/RACWQe5ELwY Tool repository: https://github.com/jaid-monwar/ATLAS-code-representation-tool
翻译:现代软件系统日益增长的复杂性凸显了传统程序分析技术的不足,尤其在软件工程任务中。尽管机器学习和大型语言模型提供了有前景的解决方案,但其效果受限于数据处理方式。与自然语言不同,源代码的语义较少由词元邻接性定义,而更多依赖于复杂的长程结构关系与依赖。这一局限在C和C++语言中尤为突出:扁平的语法层次、指针别名、多级间接引用、基于typedef的类型混淆以及函数指针调用等问题均阻碍了准确的静态分析。为应对这些挑战,本文提出ATLAS——一个基于Python的命令行接口工具,其具备以下功能:(i)生成语句级控制流图与类型感知数据流图,以捕获整个程序的跨函数依赖;(ii)支持处理包含多文件的完整C/C++项目;(iii)兼容可编译与不可编译的代码;(iv)通过抽象语法树、控制流图与数据流图构建统一的多视角代码表示。通过保留关键的结构与语义信息,ATLAS为改进下游软件工程任务及基于机器学习的程序理解提供了实用基础。视频演示:https://youtu.be/RACWQe5ELwY 工具仓库:https://github.com/jaid-monwar/ATLAS-code-representation-tool