Inscriptis provides a library, command line client and Web service for converting HTML to plain text. Its development has been triggered by the need to obtain accurate text representations for knowledge extraction tasks that preserve the spatial alignment of text without drawing upon heavyweight, browser-based solutions such as Selenium. In contrast to existing software packages such as HTML2text, jusText and Lynx, Inscriptis (i) provides a layout-aware conversion of HTML that more closely resembles the rendering obtained from standard Web browsers and, therefore, better preserves the spatial arrangement of text elements. Inscriptis excels in terms of conversion quality, since it correctly converts complex HTML constructs such as nested tables and also interprets a subset of HTML attributes that determine the text alignment. In addition, it (ii) supports annotation rules, i.e., user-provided mappings that allow for annotating the extracted text based on structural and semantic information encoded in HTML tags and attributes used for controlling structure and layout in the original HTML document. These unique features ensure that downstream knowledge extraction components can operate on accurate text representations, and may even use information on the semantics and structure of the original HTML document, if annotation support has been enabled.
翻译:指令提供将 HTML 转换为纯文本的图书馆、 命令行客户端和网络服务。 它的开发是由于需要获得保存文本空间一致性而无需参考重力、 浏览器解决方案如Selenium 等保存文本空间一致的知识提取任务的准确文本表达方式。 与现有的软件包,如HTML2text、 CHText 和 Lynx 相比, Instrictis (i) 提供了 HTML 的版图转换, 更接近标准 Web 浏览器的转换方式, 从而更好地保存文本元素的空间安排。 在转换质量方面, 指令性优异, 因为它正确转换了复杂的 HTML 构造, 如嵌入式表格, 并解释了确定文本对齐的HTML 属性的子集。 此外, 它支持批注规则, 即由用户提供的映射图, 允许根据 HTML 标签中输入的结构和语义信息, 以及原 HTML 文档中用于控制结构和布局的属性。 这些独特的特性确保下游知识提取组件可以在准确的文本表达方式上操作, 甚至可以使用HTML 原始文件结构。