In this resource paper, we present two publicly available datasets of semantically enriched human trajectories, together with the pipeline to build them. The trajectories are publicly available GPS traces retrieved from OpenStreetMap. Each dataset includes contextual layers such as stops, moves, points of interest (POIs), inferred transportation modes, and weather data. A novel semantic feature is the inclusion of synthetic, realistic social media posts generated by Large Language Models (LLMs), enabling multimodal and semantic mobility analysis. The datasets are available in both tabular and Resource Description Framework (RDF) formats, supporting semantic reasoning and FAIR data practices. They cover two structurally distinct, large cities: Paris and New York. Our open source reproducible pipeline allows for dataset customization, while the datasets support research tasks such as behavior modeling, mobility prediction, knowledge graph construction, and LLM-based applications. To our knowledge, our resource is the first to combine real-world movement, structured semantic enrichment, LLM-generated text, and semantic web compatibility in a reusable framework.
翻译:本文作为资源论文,我们提出了两个公开可用的语义增强人类轨迹数据集,并提供了构建这些数据集的完整流程。轨迹数据来源于OpenStreetMap公开的GPS轨迹记录。每个数据集均包含多层情境信息,如停留点、移动段、兴趣点(POIs)、推断的交通方式以及天气数据。其中一项创新的语义特征是引入了由大语言模型(LLMs)生成的合成且真实的社交媒体帖子,从而支持多模态与语义化的移动性分析。数据集以表格形式和资源描述框架(RDF)格式提供,支持语义推理并遵循FAIR数据原则。数据集涵盖了两个结构迥异的大型城市:巴黎和纽约。我们开源的、可复现的流程支持数据集的定制化,同时这些数据集可用于行为建模、移动预测、知识图谱构建以及基于LLM的应用等研究任务。据我们所知,本资源首次在可复用的框架中整合了真实世界移动数据、结构化语义增强、LLM生成文本以及语义网兼容性。