Lexers and parsers are typically defined separately and connected by a token stream. This separate definition is important for modularity and reduces the potential for parsing ambiguity. However, materializing tokens as data structures and case-switching on tokens comes with a cost. We show how to fuse separately-defined lexers and parsers, drastically improving performance without compromising modularity or increasing ambiguity. We propose a deterministic variant of Greibach Normal Form that ensures deterministic parsing with a single token of lookahead and makes fusion strikingly simple, and prove that normalizing context free expressions into the deterministic normal form is semantics-preserving. Our staged parser combinator library, flap, provides a standard interface, but generates specialized token-free code that runs two to six times faster than ocamlyacc on a range of benchmarks.
翻译:词法分析器和语法解析器通常分别定义,并由记号流连接。此独立定义对于模块化很重要,可以减少解析歧义。然而,将记号作为数据结构具有成本,并使用条件语句处理记号。我们展示如何融合独立定义的词法分析器和语法解析器,极大地提高性能,而不损害模块化或增加歧义。我们提出了 Greibach 正常形式的确定性变体,可以确保具有单个记号向前查看的确定性解析,并且使融合异常简单。我们证明将上下文自由表达式规范化为确定性规范形式是保留语义的。我们的分段解析组合器库 flap 提供标准接口,但会生成没有记号的专用代码,在一系列基准测试中运行速度比 ocamlyacc快两到六倍。