The automation of user interface development has the potential to accelerate software delivery by mitigating intensive manual implementation. Despite the advancements in Large Multimodal Models for design-to-code translation, existing methodologies predominantly yield unstructured, flat codebases that lack compatibility with component-oriented libraries such as React or Angular. Such outputs typically exhibit low cohesion and high coupling, complicating long-term maintenance. In this paper, we propose \textbf{VSA (VSA)}, a multi-stage paradigm designed to synthesize organized frontend assets through visual-structural alignment. Our approach first employs a spatial-aware transformer to reconstruct the visual input into a hierarchical tree representation. Moving beyond basic layout extraction, we integrate an algorithmic pattern-matching layer to identify recurring UI motifs and encapsulate them into modular templates. These templates are then processed via a schema-driven synthesis engine, ensuring the Large Language Model generates type-safe, prop-drilled components suitable for production environments. Experimental results indicate that our framework yields a substantial improvement in code modularity and architectural consistency over state-of-the-art benchmarks, effectively bridging the gap between raw pixels and scalable software engineering.
翻译:用户界面开发的自动化通过减轻密集型人工实现,具有加速软件交付的潜力。尽管用于设计到代码转换的大型多模态模型取得了进展,但现有方法主要生成非结构化、扁平化的代码库,缺乏与React或Angular等面向组件库的兼容性。此类输出通常表现出低内聚和高耦合特性,使长期维护复杂化。本文提出\textbf{VSA(视觉-结构对齐)},一种通过视觉-结构对齐合成组织化前端资产的多阶段范式。我们的方法首先采用空间感知Transformer将视觉输入重构为层次化树状表示。在超越基础布局提取的基础上,我们集成算法模式匹配层以识别重复出现的UI模式,并将其封装为模块化模板。这些模板随后通过模式驱动的合成引擎进行处理,确保大型语言模型生成适用于生产环境的类型安全、属性传递组件。实验结果表明,相较于最先进的基准方法,我们的框架在代码模块化与架构一致性方面取得显著提升,有效弥合了原始像素与可扩展软件工程之间的鸿沟。