Web search is an essential way for human to obtain information, but it's still a great challenge for machines to understand the contents of web pages. In this paper, we introduce the task of web-based structural reading comprehension. Given a web page and a question about it, the task is to find an answer from the web page. This task requires a system not only to understand the semantics of texts but also the structure of the web page. Moreover, we proposed WebSRC, a novel Web-based Structural Reading Comprehension dataset. WebSRC consists of 0.44M question-answer pairs, which are collected from 6.5K web pages with corresponding HTML source code, screenshots, and metadata. Each question in WebSRC requires a certain structural understanding of a web page to answer, and the answer is either a text span on the web page or yes/no. We evaluate various strong baselines on our dataset to show the difficulty of our task. We also investigate the usefulness of structural information and visual features. Our dataset and task are publicly available at https://speechlab-sjtu.github.io/WebSRC/.
翻译:网络搜索是人类获取信息的重要途径, 但对于机器来说, 获取信息仍然是一个巨大的挑战。 在本文中, 我们介绍基于网络的结构阅读理解任务。 给网页和一个问题, 任务就是从网页上找到答案。 任务不仅需要一个系统来理解文本的语义, 而且还要了解网页的结构。 此外, 我们建议网络SRC, 一个基于网络的新型结构阅读综合数据集。 WebSRC 由来自6.5K网页的0. 44M 问答配对组成, 收集的是相应的 HTML 源代码、 截图和元数据。 WebSRC 的每个问题都需要对网页的某种结构性理解, 答案要么是网页上的文本, 要么是/ 是/ 否。 我们评估了我们数据集上的各种强大的基线, 以显示我们的任务的难度。 我们还调查结构信息和视觉特征的有用性。 我们的数据集和任务可以在https:// speechlab-sjtu.github.io/WebSRC/ 上公开查阅 。