A11YN：对齐大语言模型以生成可访问的网页界面代码 (A11YN: aligning LLMs for accessible web UI code generation)

Large language models (LLMs) have recently demonstrated strong capabilities in generating functional and aesthetic web interfaces directly from instructions. However, these models often replicate accessibility flaws from their training data, resulting in interfaces that exclude users with diverse needs and contexts. To address this gap, we introduce A11yn, the first method that aligns code-generating LLMs to reliably produce accessibility-compliant web UIs. A11yn optimizes a novel reward function that penalizes violations of the Web Content Accessibility Guidelines (WCAG), with penalties scaled to the severity of each violation as identified by an accessibility testing engine. To support training, we construct UIReq-6.8K, a dataset of 6,800 diverse instructions for web UI generation. For evaluation, we introduce RealUIReq-300, a benchmark of 300 real-world web UI requests grounded and manually curated from public web pages, spanning a broad range of use cases. Empirical results show that A11yn significantly outperforms strong baselines, lowering the Inaccessibility Rate by 60% over the base model while preserving semantic fidelity and visual quality of generated UIs. These findings demonstrate that accessibility can be systematically optimized within LLMs, showing the feasibility of aligning code generation for accessibility.

翻译：大语言模型（LLMs）近期展现出根据指令直接生成功能完善且美观的网页界面的强大能力。然而，这些模型常会复现训练数据中的可访问性缺陷，导致生成的界面无法满足多样化用户需求和使用场景。为弥补这一不足，我们提出了A11yn——首个通过对齐代码生成大语言模型来可靠生成符合可访问性标准的网页界面的方法。A11yn优化了一种新颖的奖励函数，该函数依据可访问性测试引擎识别的违规严重程度，对违反《网页内容可访问性指南》（WCAG）的行为进行分级惩罚。为支持训练，我们构建了UIReq-6.8K数据集，包含6,800条多样化的网页界面生成指令。在评估方面，我们提出了RealUIReq-300基准测试，该基准包含300个基于真实网页手动构建的网页界面需求，覆盖广泛的实际应用场景。实验结果表明，A11yn显著优于现有基线方法，在保持生成界面语义准确性和视觉质量的同时，将不可访问率较基础模型降低了60%。这些发现证明可访问性能够在大语言模型中被系统化优化，为实现面向可访问性的代码生成对齐提供了可行性验证。