Fyi, this is a design document for Temper, a programming language for high-assurance libraries that run inside anything.

Simpler, error-tolerant parsers for {C.like;} languages

Error tolerant parsing techniques for C-like languages to enable better tool integration, partial analysis, language extensibility, and meta-programming.

Keywords: combinators, homoiconicity, IDE, meta-programming, operator-precedence, parsing, quasiquotation

    Abstract

    Most developers are familiar with languages whorse surface syntax is "C-like": C, C++, C#, Go, Java, JavaScript, Perl, Rust, Scala, Swift, and TypeScript. But the complexity of parsing C-like languages makes it hard to achieve what Lisps take for granted.

    Lisps reuse a few syntactic constructs that transparently relate to commonly used datatypes. This makes it easy for programmer tools to deal with program fragments, for designers to extend the language with new special forms or macros, and for programs to manipulate programs as data.

    Lispers often group these benefits under the rubric of “homoiconicity”. That term has some problems, so this document adopts goals attributed to homoiconicity but does not claims that any resulting language is homoiconic. These goals include:

    Unfortunately large grammars make it hard for tools simpler than the compiler to accurately deal with program fragments, to degrade gracefully, or to deal with code as data. This limits a diverse tool ecosystems to languages with a large community that can afford the engineering effort.

    This document introduces a way for niche languages to use familar C-like syntax and hopefully still support a diverse tool ecosystem. Specifically it

    Covering Grammars

    Grammars with many, fine-grained rules can do a good job preventing nonsensical inputs from confusing later compiler stages, but typically produce either a parse tree, or error messages, not both.

    Statement::==for( Stuff ) { Stuff }
    /if( Stuff ) { Stuff }
    /while( Stuff ) { Stuff }
    /

    A large grammar like this could adapt to account for common developer errors and to deal with partial inputs, but this takes engineering effort per rule, and makes the grammar even larger.

    if ( x ) { fn(; }
    
       ┗━━━┛ ┃   ╹  ┃
             ┗━━━━━━┛
    

    A parser that degrades gracefully on inputs like the above, should recognize the kind of rules-of-thumb that programmers use to reason about broken inputs:

    The grammar above has a lot of repetition. In languages with C-like grammars, most of the flow control special forms, and some other special forms (like JavaScript's function(...) {...}) follow the convention:

    Statement::==keyword [ ( Stuff ) ] { Stuff }

    A cover grammar is a grammar that matches all strings matched by some “covered” grammars so that a later pass can disambiguate between them. For example, consider the JavaScript expression (a = null) => (a = null). In it, (a = null) appears twice, once as a formal parameter list with a default value, and once as an assignment. To avoid reparsing based on whether a => follows the close parenthesis, JavaScript engines often use a cover grammar for FormalParameterList and PrimaryExpression.

    We can formalize developers' intuitions about a language by crafting a few rules that cover the myriad rules in the full grammar. This lets a few error recovery strategies tuned to programmers' intuitions carry much weight.

    Each of the branches for Statement above looks like a function application followed by a block, but there are a some conventions which distinguish these from normal function calls.

    Additionally, certain keywords allow chaining these constructs, e.g. else.

    Most C-like flow control constructs can be covered by a single rule:

    FlowControl ::==Keyword ParenPartopt Tail;
    ParenPart ::==( TokenSoup );
    ParenPartopt::==ParentPart / ε;
    Tail::=={ TokenSoup } MoreFlowControl
    /;;
    MoreFlowControl ::==InfixKeyword FlowControl
    /ε;
    InfixKeyword ::==else if / else / catch / finally;

    Not all follow this convention but many do, and below we detail how to handle exceptions to the rule.

    Stages

    This document builds a parser in JavaScript so that it can run in your browser. You can see below that the parser operates in three stages:

    1. Input: Source Text

      f(x + y);
    2. Lexical Phase: Source Text → Tokens

      f ( x + y ) ;
    3. Parse Phase: Tokens → Parse Tree

      f ( x + y ) ;
    4. Final Phase: Parse Tree → Abstract Syntax Tree

      stmt
      expr
      call
      expr
      id f
      actuals
      expr
      binaryop
      expr
      id x
      +
      expr
      id y

    Before jumping into code, it's worth defining some terms. For the purposes of this document:

    Token
    A substring of a program fragment with some diagnostic metadata. Tokens may be malformed, e.g. an unclosed quoted string.
    Significant token
    A token that is not a comment or ignorable whitespace.
    Parse Tree
    Concrete Syntax Tree (CST)
    A tree such that leaves are tokens and a preorder enumeration of leaves yields the significant tokens from which it was parsed.
    Abstract Syntax Tree (AST)
    A tree such that each node is either a token, or is a collection of sub-trees and a tag that identifies a production in the language's abstract grammar.

    An AST need not contain all significant tokens, e.g. an implementation might drop parentheses used merely for grouping, or tokens like “if” which are implicit in the tag.

    An inner AST node may have the special error tag in which case its children are a soup of tokens to which later passes should not assign clear semantics.

    Lexer
    Partitions source text into a stream of tokens.
    Parser
    A consumer of a stream of tokens that produces a parse tree.
    Combinator
    A consumer of a stream that builds a structured output based on declarative rules describing patterns involving sequences of stream elements.

    Precedence

    Operator precedence parsers reuse a few simple rules. It's straightforward to craft such a parser to produce a parse tree for every input so they recover well from many kinds of syntax errors. Instead of having a grammar rule for each language construct, an operator precedence parser behaves like a grammar driven parser with just one rule per kind of operator:

    Operand ::==PrefixOperator Operand
    |Operand InfixOperator Operand
    |Operand TernaryOperator Operand TernaryFollower Operand
    |Operand OpenBracket Operand CloseBracket
    |Operand PostfixOperator
    |SimpleOperand;

    Operator precedence parsers use “operators” to figure out how to turn a series of tokens into a tree structure. This lets us infer parentheses in expressions and to identify when a keyword like else or the while in do...while... groups things to its immediate left and right. So PrefixOperator defines which operators can appear before their operands like the ‘-’ in “-x”; InfixOperator defines which can appear in the middle like the ‘+’ in “x+y”.

    An operator precedence parser still needs to decide how to group operands: for example, if InfixOperator::==+|×; then naïvely applying the grammar above to “a×b+c” would give two possible inerpretations: (a×b)+c and a×(b+c).

    Operator precedence parsers use an order over operators to avoid ambiguity. This order gives precedence to some operators and answers the question “Can this operator contain that operator unparenthesized?” For example, if ‘+’ can contain ‘×’ unparenthesized but not vice versa then valid interpretations of “a×b+c” include (a×b)+c but not a×(b+c).

    Here is an operator precedence table similar to those that appear in user-documentation for C, Java (normative), and JavaScript which gives a foundation for our precedence relation.

    We need to be able to match up brackets.

    Some derived tables simplify later code greatly.

    These operators give arithmetic & logical expressions theit usual meanings.

    Note: Herein, parse trees are shown with a rectangle around each inner node. Click the expando (▶) to see an indented view of the same. These diagrams are generated programatically by the described parser running as JavaScript in your browser.

    Less-than ambiguity

    There is a deep, unavoidable ambiguity in C++, Java, and probably Rust. A ‘<’ token may be either infix operator less-than or the start of an angle bracketed group of type parameters. Consider the partial token sequence … f ( a < b , c > d ) … in Java:

    public <b, c> int f(a<b, c> d) {
      //   ⇧    ⇧        ⇧    ⇧
      // Angle brackets group type parameters in a method signature.
    
      return d.f(a < b, c > d);
      //           ⇧      ⇧
      // Using less than, greater than to compute actual parameters.
    }
    

    This is a real problem and this document only provides a partial solution to this ambiguity. We take a cue from TypeScript which allows types (and hence type parameters) in a few readily identifiable contexts:

    Here's how things parse when our operator precedence function only allows <…> to the right of the operators listed above.

    Note: JSXElements like (image = <img src={src}>) do not require parse-time disambiguation. The code herein does not handle JSX, but later sections introduce mechanisms that should extend to that use case. Specifically, lexing explains how to use one token of lookbehind to decide whether ‘/’ starts a division operator or a regular expression literal. A similar approach could decide whether a ‘<’ starts an operator or a JSX block. Also, preparsing talks about a technique to group template literal parts which would help handle JSX.

    Dead-end: “Part-of-speech” tagging

    Dead-end: “Part-of-speech” tagging

    This approach was found to be brittle, and hard to build confidence in. It is included to document a dead-end.

    It may be possible to resolve ambiguities, like angle bracket ambiguity, would be to provide additional context cues to the parser.

    A part-of-speech tagger could run as a part of the pre-parse processing pass over tokens to add context usable by the operator precedence function. Parts of speech might include:

    This example shows two distinct uses of ‘<’: once to bracket type parameters, and later as a comparison operator. It also shows three distinct uses of <{>: first to start a statement block, then o bracket a record type, and finally to construct a record.

    An operator precedence function that used part-of-speech cues could assign different precedences to these different uses.

    Here's a tagger that seems to work well for TypeScript, but is brittle.

    This tagger is a fairly complicated push-down automaton, but languages for which it's possible to specify may be able to exploit that fact to simplify their precedence functions.

    Flow control joiners

    Besides that, expressions in C-like languages are a good fit for precedence tables like this. It turns out that statements can be shoe-horned in with a wrinkle though. As discussed, most flow control constructs in C-like languages have the form keyword (expression) statement.

    There are a few keywords that precede blocks but that always follow statements. And ‘;’ separates statements.

    Statement parsing

    With this, we're ready to parse some statements.

    do…while

    Except that in do{...}while(...); the while appears like an infix operator, but only directly inside a do which acts like a prefix operator.

    Flavours of colon

    Also, ‘:’ is odd. It seems to have widely distinct precedences:

    Some recent, C-like languages have less ‘:’-ambiguity. Kotlin, for example, uses ‘@’ in labeled statements and allows “if…else…” in expressions in lieu of a ternary operator.

    Infix curly brackets

    There's one more wrinkle. In keyword (expr) { stmt; }, the {}s act as an infix operator like the ()s. This is not what we want for the second pair of {}s in while (b) {x} {y}. In C-like languages, {y} is a block that executes separately from, and after the while construct. So we will disallow chaining of {} as an infix operator.

    Expression preceding keywords

    Defining prefix operators for keywords like return that precede bare expressions ensures we won't get a different parse tree depending on whether parentheses are present, and return-1 won't parse to infix ‘-’ as it should in x-1.

    An operator precedence function

    Knowing all this, we're ready to craft our answer to “Can this operator contain that one unparenthesized?”:

    This function takes >50 lines of code to handle legacy constructs and corner cases from a slew of languages to demonstrate that handling these is feasible. An equivalent function for a language that did not need to support all these corner cases might be much simpler. For example, using […] to group type parameters, a la Scala, would obviate half of this code.

    This function depends on some bookkeeping functions to juggle brackets.

    Separators

    This works pretty well but there are a few quirks worth mentionining.

    In f(actual0, actual1, actual2) there are nested infix commas (,) which will later require a bit of work in the grammar (see commaop:).

    But semicolons (‘;’) are postfix operators so there's no nesting in blocks, and in for (init;cond;incr).

    In for loops, parts are optional. “for (;;) {}” has a postfix semicolon operator applied to a leaf ‘;’ token which requires an extra branch in the for-loop rule.

    Multiple adjacent NOPs behave the same.

    Parsing

    Before defining the parser, we abstract out the set of operators. (This is just a wrapper for our operators array, but the abstraction will come in handy later when talking about user-defined operators a la Scala.)

    We can use the operator definitions and precedence relation to define a parser.

    Tests use these unless otherwise specified.

    Lexing

    We've defined a parser that operates on a stream of tokens, but have not yet defined the source of those tokens. This lexer is not interesting except to demonstrate strategies for dealing with irregularities and recursive lexical constructs without entangling the parser & lexer.

    Note: Readers may find it confusing when I use the term “context-free” (CF) to refer specifically to the lexical grammar instead of the language as a whole. Some C-like languages have simple lexical grammars but are not CF. E.g., in C++ the parser needs context about whether “x” is a type to decide whether x* y; is a declaration of a pointer variable or an application of operator *. This kind of complication does not mean that an input cannot be split into tokens with a CF lexical grammar, even though you may need to do some tricks when “>>” follows type parameters.) See also Trevor Jim.

    C-like languages' lexical grammars need not be regular, meaning there need not exist a single regular expression that partitions the input into tokens. JavaScript and Perl give some ideas of where common non-regularities arise; neither have either regular or context-free lexical grammars; both require scannerless parsing.

    It is possible to restructure the parser above to be scannerless; where the parser above asks “is the current token an infix operator?”, a scannerless parser would ask “which, if any, is the longest infix operator that is a prefix of the remaining input and can nest here?”.

    We do not do that though, because non-CF lexical grammars defeat our goal of allowing simple analyzers to correctly segment fragments of code that start and end on token boundaries.

    So CF grammars meet our goals, but regular lexical grammars are insufficient for commonly used, C-like languages. Complications arise when deciding whether ‘/’ starts a division operator (like / or /=) or a /regular-expression-literal/.

    Below is a lexer that recognizes /regular-expression-literals/ and JavaScript-esque `template ${ literals }`. It is CF but non-regular in two ways:

    Our lexer works as expected on some sample inputs. These tests show the series of tokens produced, with comment and whitespace tokens filtered out.

    Before lexing

    Before lexing, a source file needs to be converted from a stream of bytes to a stream of larger code-units. Some recent languages have mandated UTF-8 as the way to do this.

    For example, the Go specification says:

    Source code is Unicode text encoded in UTF-8.

    Language-mandated encodings mean that there's no risk of different tools assuming different encodings for the same source files, and no inconsistently supplied command line flag hints.

    Some languages do more before lexing happens. Java, and JavaScript, for example, both allow \uABCD to appear in both string literals and in identifier names. Before UTF-8 had obviously won, this made it straightforward to convert a program to 7-bit ASCII for easy interchange; escape any codepoint ≥ 128 using “\u” syntax and if there're an odd number of ‘\’ preceding, remove one. (This no longer works for JavaScript because the raw content of tagged template literals is observable.)

    // These two are equvalent in both Java and JavaScript.
    π      = 3.14159D;
    \u03c0 = 3.14159D;
    

    Java, unlike JavaScript, does this before the lexer has a chance to match quotes though.

    char doubleQuoteChar = '\u0022';
    doubleQuoteString = "\u0022"; // Syntax error because equivalent to:
    doubleQuoteString = """;
    

    Encodable string delimiters violate developers' intuitions about the boundaries between programming language tokens and tokens in the embedded language that describes the content of the string values in question.

    This oddity of Java can be used to hide code from static analyzers and to make silly programs that work in multiple languages, but adds no expressive power or clarity to Java.

    // \u000a       public class C {
    // \u000a         static void log(String s) { System.out.println(s); }
    // \u000a         public static void main(String[] argv) {
    /* \u002a/// */     let {log} = console;
                        log(
                          "Java"
    /* \u002a/// */       + "Script"
                        );
    // \u000a       } }
    
    // This confuses tools.
    // E.g., in Emacs's Java-mode, only the JavaScript for this program
    // shows as non-comment text.
    

    Most C and C++ programmers go their entire career without ever intentionally using digraphs or trigraphs which are expanded by the C preprocessor.

    // These two were equivalent until recent versions of C++.
      { x =   ~y;   }
    ??< x = ??-y; ??>
    
    // A source of hard-to-diagnose bugs when '??' appears in a string literal.
    // What does this do?
    cout << "??/"; // ";
    

    Complex, language-specific transformations before lexing are sources of hard-to-diagnose bugs when tools, code generators, or programmers forget about them. For languages that don't have legacy EBCDIC code, I've yet to see a case where complex decoding logic could not be delayed until after processing high-level structure like matching of quotes and brackets.

    Token preparsing

    Unfortunately, our parser can't deal with complex constructs like `staticText ${ dynamicValue } etc.` that span multiple tokens.

    We could build knowledge of this into the parser, but that would defeat the goal of having few, simple rules, and moving most of the understanding into the language-specific parse tree → AST transform.

    Instead, we transform the token stream between the lexer and the parser by adding synthetic tokens. This keeps knowledge about complex lexical constructs in the pre-parse phases.

    Before `staticText ${ dynamicValue } etc.`
    After (`staticText ${ (dynamicValue)} etc.`)

    The added parentheses have the effect of ensuring that the whole string is grouped as one construct, and that expressions inside holes (${...}) are self-contained. Open parenthesis is overloaded in most C-like languages to mean (grouping, function call, type cast) so we take care that the synthesized parentheses around holes are not interpreted as INFIX operations.

    We can see that this parses sensibly, even when there's an errant ‘;’ in an expression that fills a hole, and that the parentheses around holes do not participate in function application.

    Automatic Semicolon Insertion (ASI)

    Some C-like languages automatically insert semicolons.

    let x = 1    /* ; */
    let y = 2    // Not here
            + x  /* ; */
    

    This may make writing code easier, requiring fewer keystrokes and getting rid of common error messages about missing semicolons. ASI may be a net positive for authors, but in all languages that do it is a source of uncommon, but subtle bugs. Whether it is better for readers is unclear; does careful code review require correctly predicting where semicolons are inserted.

    This section discusses different varieties of semicolon insertion and how they affect operator precedence parsing. Later, suggestions provides concrete advice. The parser introduced above does not do semicolon insertion.

    Syntactic ASI

    ASI can be deeply entangled with parsing:

    This operator precedence scheme allows two non-operator tokens to appear adjacent: int x. An attempt to build ASI into an operator precedence parser would need to design the language to avoid adjacent words or explicitly handle this via some kind of adjacency operator, so that the parser could elect to insert a semicolon when the current token would require popping the stack past semicolon's precedence level, and a postfix semicolon could accept the lowest precedence stack element that has precedence greater than semicolon.

    Lexical ASI

    Alternatively, ASI can be based on lexical analysis before parsing starts, which would allow it to be done during token preparsing which would allow ASI without entangling the parser:

    Effect of ASI on Tooling

    Simple operations like moving tokens from one line to another to split overly long lines becomes more complicated when you need to ensure that the compiler won't differently insert semicolons. For example, in JavaScript ASI can cause two similar looking programs to be interpreted quite differently (see also Ecma262 ¶ 11.9.2):

    for (let i = 0; i < 10; i++) {
      f(i);
    }
    //        ↓
    for (let i = 0; i < 10; i
         ++) {
      f(i);
    }
    // SyntaxError.  Per JavaScript's ASI rules `++` cannot be separated
    // from its operand by a line break.
    
    
    longName = i++
    f()
    //        ↓
    longName = i
    ++
    f()
    // Loud semantic change.  Second equivalent to `longName=i;++(f());`
    
    
    function f(longName) {
      return longName()
    }
    //        ↓
    function f(longName) {
      return
        longName()
    }
    // Silent semantic change.  Call never happens.  Returns undefined
    

    With ASI, a developer using a generic editor function like Emacs's fill-paragraph (M-q) without a JavaScript-specific mode hook runs a risk of introducing subtle semantic changes to code. This is subideal.

    Tool-based ASI

    Another option (unimplemented in any language to the best of the authors' knowledge) is to do ASI but not in a way that's invisible to authors and readers.

    Much of the debate about ASI can be boiled down to two claims which seem mutually contradictory but are actually reconcilable:

    Language designers could specify how ASI should be done, commit to taking ASI into account when considering backwards compatibility, but leave the actual insertion to developer tools.

    To demonstrate this, here's a simple lexical semicolon inserter.

    Custom Expression Operators

    Scala shows one way of providing for user-defined infix expression operators.

    Any method with a single parameter can be used as an infix operator.

    When an expression uses multiple operators, the operators are evaluated based on the priority of the first character: …

    It's unsurprising that operation precedence parsers can handle user defined operators with different precedences.

    Now, user code can use an operator not defined above like ‘<=>’ (trinary comparison in Perl) with precedence similar to ‘<’:

    This requires some changed to lexing since our stock lexer splits ‘<=>’ into two tokens: ‘<=’ and ‘>’. The testcase above uses a lexer that simply splits on word breaks and spaces.

    One downside is that custom operators require developers to use more whitespace even in code that does not use custom operators. This limitation arises because there is no closed set of punctuation strings that lets us split, for example “x*-y” into “x * -y” because “*-” could be a user defined operator. C already has this problem to a small degree because “x--y” is not the same as “x - -y”, but developers typically write “x+y” instead of subtracting a negated value, so the fact that ‘++’ and ‘--’ are unsplittable by context-free lexers does not, in practice, confuse C authors.

    Well-formedness

    These tests show some common ways a parse tree may fail some generic well-formedness checks, and in the process catalogue some common kinds of problematic inputs. Red wiggly underlines show tokens that participate in a malformed subtree. Click the expando to see the error messages.

    Deriving ASTs via Combinators

    Many tools properly operate on parse trees, but some, chiefly compilers and interpreters, do better with representations that lack extraneous details, and are more directly related to language specification abstractions or elements of the output. “Compilers: Principles, Techniques, and Tools” by Aho, Sethi, Ullman contrasts parse trees with a more abstract tree representation:

    Abstract syntax trees, or simply syntax trees, differ from parse trees because superficial distinctions of form, unimportant for translation, do not appear in syntax trees.

    Parser combinators are a fine way to turn a source text into a parse tree, but they can also be used to derive an abstract syntax tree (AST) from a parse tree.

    These tests show ASTs. Each inner AST node has a green border, and the node type in small, green print overlapping the top-left of its border. Leaf nodes are black text, have no type, and always correspond to tokens in the input.

    And this doesn't prevent containing errors:

    Combinators operate on a sequence of tokens, so, first, we need to transform our parse tree into a sequence.

    Flattening a parse tree
    Inputletx=1;
    Tokens"let""x""=""1"";"
    Parse Tree[["let","x"],"=", ["1"],";"]
    Flat Tree"let""x""=""1"";"

    As you can see, a flattened parse tree:

    The flatten function takes a parse tree and produces a sequence with indent and dedent pseudo-tokens around the content of each inner node.

    Because the flattened tree contains explicit ⇨ and ⇦ it can be consumed by simple combinators that make no special provisions for either Left- or Right- recursion.

    There's nothing novel about the combinator definitions, so if you already know how combinator libraries tend to work, you can gloss over them without missing much. The main thing to note is that they don't use side-effects to build a parse tree. Combinators read an input buffer containing a flattened parse tree and build a flattened AST by appending to an output buffer which is then “lifted” into a proper tree in a reverse of the flatten function used above.

    Definitions of combinators

    Here is a grammar for a toy language that looks like TypeScript. This grammar has gaps (no switch); it is too long already.

    Unless otherwise stated, examples hereafter use this transform from parse trees to ASTs:

    The rest of the end-to-end tests are a bit duplicative.

    Let statement tests
    For loop tests
    Function declaration tests
    More expression tests
    More statement tests

    Quantifying Parser Resilience

    To answer “How good is this parser at containing damage?” we collect statistics on how many substrings can be removed from a benchmark code sample without affecting the parsing of structures on either side.

    outerBefore();
    {
      innerBefore();
      f(a/2*'3');
      innerAfter();
    }
    outerAfter();
    

    This benchmark code fragment was crafted to include a variety of token types, and so that some mutations would introduce new token types like comments and regular expression literals.

    For each possible substring of the underlined portion, we parse a program having removed that substring from the benchmark, and test several properties of the resulting AST:

    has outer beforehas inner before has inner afterhas outer afterhas error

    The parser reliably parses statements before the mutated statement and in a different block. The parser fairly reliably parses statements after the end of the block, and in the same block but before, but is not perfect; removing the substring “2” produces f(a/*'3'); and the block comment consumes everything after it. The parser has mixed success parsing statements that occur after the mutated statement in the same block.

    Code as data

    The parsing scheme outlined above provides for a limited kind of homoiconicity; tools less complex than the compiler can consume fragments of code using a small set of parsing rules that readily extend to new special forms.

    Homoiconicity also makes it easier for programs to operate on or produce program fragments. Quoting from “Quasiquotation in Lisp” (Bawden 1999):

    Quasiquotation is the technology commonly used in Lisp to write program-generating programs.

    The backquote character (`) introduces a quasiquotation … Inside the quasiquotation, the comma character (,) marks expressions whose values are to be substituted into the result.

    S-expressions were at the core of McCarthy's original version of Lisp. The ability to manipulate programs as data has always been an important part of what Lisp is all about. But without quasiquotation, actually working with S-expressions can be painful.

    So quasiquotation requires two affordances:

    JavaScript-esque template literals provide this, but run afoul of:

    The string substitution that underlies [fprintf] has no understanding of the syntactic structure of the programming language being generated.

    Javascript's tagged template literals allow that, but require implementing a parser for code with holes and lose source metadata. This section shows how this operator precedence scheme can meet the goals of quasi-quotation within a C-like language.

    … goals for a successful implementation of quasiquotation:

    
    
    

    Defining “\{” and “\(” as bracket operators, enables quoting statements and expressions. Defining “${” as an expression operator allows embedding unquoted expressions inside quoted code. Then we add some extra productions to the toy grammar, and augment a few existing ones.

    
    
    

    With that “\{}” embeds a parse sub-tree as data. qtree and other bluish AST nodes recreate the parse tree structure of the quoted portions.

    If qinner behaves like array and qleaf like literal then that quasiquotation would produce a tree like data structure. The qhole breaks back out into unquoted AST nodes that could contribute to the larger data structure.

    Since “\{}” uses the cover grammar, its consumers can recognize forms not allowed by the parse-tree → AST phase. In this sense, it is similar to Common Lisp's read which recognizes some characters (‘[’, ‘]’, ‘{’, ‘}’, ‘?’, and ‘!’) which are not used by Common Lisp syntax but which are explicitly reserved for user extensions.

    List structure is not quite as stark a representation [of code] as character strings, but it is still pretty low-level. Perhaps we would be happier if, instead of manipulating lists, our quasiquotation technology manipulated objects from a set of abstract data types that were designed specifically for each of the various different constructs in our language (variables, expressions, definitions, cond-clases, etc.). After abandoning character strings as too low-level it seems very natural to keep moving towards even highler-level representations.

    Separately, “\()” quotes an AST with holes. These two constructs let program generators deal with program fragment templates at two different levels of abstraction: parse trees and ASTs.

    Risk: Lock-in

    The second kind of quasiquotation exposes details of language specification abstracts to user code. Some language designers have expressed concern that this may make it harder to evolve the language. On a proposal for a binary encoding for EcmaScript AST:

    WH: compatibility means existing nodes strips compiled continue to work as we upgrade the language. But as ECMAScript evolves the same text source code may compile to different nodes even if they don't use new constructs. … The Hyrum's Law challenge is that folks may come to rely on source text compiling to a specific AST.

    The parse tree form has fewer rules, but there is still a non-zero lock-in risk:

    WH: In this committee it came up discussing changing associativity of || from left-to-right to right-to-left in order to properly support one of the variants of the ?? proposal. It's invisible from within ECMAScript but would change which AST gets generated. There was opposition related to the effect this would have on Babel's ASTs.

    In Lisp, the grammar for S-expressions is very simple and stable. Without experience maintaining a language whose initial parsing is based on cover grammars & precedence table, it's hard to say how stable these are over time, or what strategies are effective in working around compatibility problems due to Hyrum's law.

    Limitations & CounterExamples

    This parser scheme produces a parse tree that groups tokens for some C-like languages into a tree that is unambiguous and which can be processed into an AST.

    It deals poorly with some idioms and grammatical constructions.

    Angle bracket ambiguity

    As previously discussed, there is a deep, unavoidable ambiguity between ‘<’ as a comparison operator and as a bracket around type parameters.

    Unless there is a closed set of operators that precisely distinguish type contexts from exception context, this is incorrigible. template could be defined as a prefix operator, but operator precedence parsing cannot take into account facts like “vector is a type name” without CF-violating entanglements.

    Flow control without brackets

    This is subideal because the low precedence “--” operator captures the apparent function call if (cond) since it can function as both a prefix and postfix operator. This can be partially fixed by defining if and friends as low-priority prefix operators, as is already done for do.

    The if node is now self contained, but without some invisible adjacency operator auto-inserted between the close parenthesis and -- the parse tree is still odd. Worse, the -- acts as a postfix operator on if so this alternate nesting scheme would require extra rules to prevent postfix operators applying to if.

    Worse, this is no longer homoiconic. New operators that want to syntactically mimic the structure of if need to have their own entry in the precedence table.

    Else Else If

    The cover grammar allows else if to follow else which might introduce some shift-reduce style problems. This may be corrigible by tweaking canNest with respect to a prefix if and infix else or by recognizing else if as a separate, two token infix operator in preparseTokens as for Python-esque not in and is not.

    This is not a problem when the language requires brackets around if bodies as the toy language does.

    Flow control without parentheses

    Go makes parentheses optional around flow control constructs; it uses for init;cond;incr { } instead of for (init;cond;incr) { }.

    This is subideal because the low priority && operator captures the higher priority infix {} operator. Again, defining infix curly brackets as low-priority can probably address this but the author has not proven this. (TODO: try this. It shouldn't break do{}while)

    Scoping

    It would be nice if tools could make high-quality-ish conclusions about chunks of a program. One of the most important things is to match declarations with uses, which is often a prerequisite to identifying free variables.

    The author has been unable to find a generic algorithm that identifies declarations.

    It is generally the case that in keyword (...) {...} any declarations immediately inside the parentheses are in scope for the code in the curly brackets.

    This is not sufficient.

    1. To recognize a declaration in C, C++, and Java, you need to have a list of symbols that are type names. For example, in C++, {x * y;} is a declaration of a pointer when x is a type name, or an invocation of the infix ‘*’ operator when it is not.

      In Java, inner classes can be inherited, so local analysis is not sufficient to tell whether x.y.z is a reference to a static member of unqualified type x, a reference to a member of a (possibly inherited) field x, or a reference to a static member of a class with the absolute name x.y.

      Languages that consistently set declarations apart with a keyword like let, const, val, or var could define those as prefix operators and may avoid this problem.

    2. Code inside a class declaration in languages like Java and C++ inherit masking symbols, so you need to understand the inheritance structure to draw conslusions about variables in scope.

      Languages that require this.x to access fields, or that shorten this to .x by defining ‘.’ as a prefix operator can avoid this problem.

    Suggestions

    If you wish your language to be parseable by schemes like this:

    Summary

    Separating parsing of coarse-level structure from fine-grained structure provides benefits including: better tool support during development, easier partial program analysis, language extensibility, and metaprogramming.

    Some simple guidelines make it easier to craft a new language that gets these benefits and uses a syntax familiar to developers who work with widely-used, C-like languages.

    Ambiguity between less-than & angle-brackets means that not all existing languages cannot be migrated to this scheme, although, strangely, some complex languages like TypeScript may be migratable. There are clear precedents that show how to design around this ambiguity.

    That ambiguity avoided, operator precedence parsers can parse C-like languages. (Operator precedence parsers have not been widely used for parsing C-like languages, except in conjunction with RD, though Perl 6's Parser Grammar Engine may be one precedent.)

    This operator precedence parser degrades gracefully given partial and broken inputs. It combines a small fixed set of rules with an extensible operator partial-order, so also meets the extensibility goal.

    The same techniques that allow grammar extensibility, also allow extensible quasiquotation and use cases enabled by Common Lisp's read.

    Try it out

    Input
    Tokens
    Parse Tree
    Problems
    Flat Tree
    AST
    Parts of Speech

    ↑ Output, ↓ Input



    Thanks for reading!