Region representation learning plays a pivotal role in urban computing by extracting meaningful features from unlabeled urban data. Analogous to how perceived facial age reflects an individual's health, the visual appearance of a city serves as its "portrait", encapsulating latent socio-economic and environmental characteristics. Recent studies have explored leveraging Large Language Models (LLMs) to incorporate textual knowledge into imagery-based urban region representation learning. However, two major challenges remain: i) difficulty in aligning fine-grained visual features with long captions, and ii) suboptimal knowledge incorporation due to noise in LLM-generated captions. To address these issues, we propose a novel pre-training framework called UrbanLN that improves Urban region representation learning through Long-text awareness and Noise suppression. Specifically, we introduce an information-preserved stretching interpolation strategy that aligns long captions with fine-grained visual semantics in complex urban scenes. To effectively mine knowledge from LLM-generated captions and filter out noise, we propose a dual-level optimization strategy. At the data level, a multi-model collaboration pipeline automatically generates diverse and reliable captions without human intervention. At the model level, we employ a momentum-based self-distillation mechanism to generate stable pseudo-targets, facilitating robust cross-modal learning under noisy conditions. Extensive experiments across four real-world cities and various downstream tasks demonstrate the superior performance of our UrbanLN.
翻译:区域表征学习通过从未标注的城市数据中提取有意义的特征,在城市计算中发挥着关键作用。类比感知的面部年龄反映个体健康状况,城市的视觉外观可作为其“肖像”,蕴含潜在的社会经济和环境特征。近期研究探索利用大型语言模型(LLM)将文本知识融入基于图像的城区表征学习。然而仍存在两大挑战:i) 细粒度视觉特征与长文本的对齐困难;ii) LLM生成文本中的噪声导致知识融合效果欠佳。为解决这些问题,我们提出名为UrbanLN的新型预训练框架,通过长文本感知与噪声抑制改进城区表征学习。具体而言,我们引入信息保持的拉伸插值策略,在复杂城市场景中将长文本与细粒度视觉语义对齐。为有效挖掘LLM生成文本中的知识并滤除噪声,我们提出双层优化策略:在数据层面,采用多模型协作流程自动生成多样且可靠的文本描述,无需人工干预;在模型层面,运用基于动量的自蒸馏机制生成稳定的伪目标,促进噪声环境下鲁棒的跨模态学习。在四个真实城市数据集及多种下游任务上的大量实验验证了UrbanLN的优越性能。