Text-attributed graphs require models to effectively integrate both structural topology and semantic content. Recent approaches apply large language models to graphs by linearizing structures into token sequences through random walks. These methods create concise graph vocabularies to replace verbose natural language descriptions. However, they overlook a critical component that makes language expressive: grammar. In natural language, grammar assigns syntactic roles to words and defines their functions within sentences. Similarly, nodes in graphs play distinct structural roles as hubs, bridges, or peripheral members. Current graph language methods provide tokens without grammatical annotations to indicate these structural or semantic roles. This absence limits language models' ability to reason about graph topology effectively. We propose \textbf{G2rammar}, a bilingual grammar framework that explicitly encodes both structural and semantic grammar for text-attributed graphs. Structural grammar characterizes topological roles through centrality and neighborhood patterns. Semantic grammar captures content relationships through textual informativity. The framework implements two-stage learning with structural grammar pre-training followed by semantic grammar fine-tuning. Extensive experiments on real-world datasets demonstrate that G2rammar consistently outperforms competitive baselines by providing language models with the grammatical context needed to understand graph structures.
翻译:文本属性图要求模型能够有效整合结构拓扑与语义内容。现有方法通过随机游走将图结构线性化为词元序列,从而将大语言模型应用于图数据。这些方法构建了简洁的图词汇表以替代冗长的自然语言描述,却忽视了构成语言表达力的关键要素:语法。在自然语言中,语法为词汇分配句法角色并定义其在句子中的功能。类似地,图中的节点作为枢纽、桥梁或边缘成员扮演着不同的结构角色。当前的图语言方法仅提供缺乏语法标注的词元,无法指明这些结构或语义角色。这种缺失限制了语言模型有效推理图拓扑的能力。本文提出\textbf{G2rammar}——一个为文本属性图显式编码结构与语义语法的双语语法框架。结构语法通过中心性与邻域模式刻画拓扑角色;语义语法通过文本信息量捕捉内容关联。该框架采用两阶段学习策略:先进行结构语法预训练,再进行语义语法微调。在真实数据集上的大量实验表明,通过为语言模型提供理解图结构所需的语法上下文,G2rammar始终优于现有基线方法。