Fyi, this is a design document for Temper, a programming language for high-assurance libraries that run inside anything.

Pattern Decomposition

Pattern matching syntax lets a little code do a lot. Who has an interest in how matching works? Can we extend pattern matching beyond algebraic data types to niches like functional streaming APIs, comprehensions, parsing, and unparsing?

    Abstract

    “Pattern matching” provides succinct, readable syntax that balances two common concerns: the need to recognize boilerplate data that determines which strategies might succceed, and the need to extract essentials bits needed by next steps.

    This document discusses:

    Stakeholders

    Authors

    Code authors have an interest in being able to succinctly decompose a problem into cases. This is too vague so lets consider how authors adopt more specific roles when faced with particular problems.

    Routing Code Authors

    Authors of network services need to route incoming messages to code that handles them and separate salient bits from boilerplate that is only relevant for routing.

    People reviewing routing code want to ensure that untrusted inputs are converted to well-formed domain objects as soon as possible, and that unexpected and unvetted content does not reach deeper, less-suspicious layers of the application.

    People debugging routing code would like to know why an input didn't match any case. Often pattern matching is all or nothing, but a good pattern matcher might be able to, at the cost of reapplying patterns in a less efficient mode, figure out which came closest to matching.

    Let's make up some syntax and consider how to balance these needs for some routing code.

    (x : inpmessage) =~ {
      // Here, x is the input message.
      // The `=~` indicates that the left is being
      // matched.
      // The right is what is matched against: a block of `case`
      // statements.  "Patterns" appear between `case` and `:`.
    
      // A pattern is an arbitrary parse tree that need
      // not pass the language's well-formedness checks.
    
      // The type of the left operand (inpmessage in this case)
      // is asked to decompose the parse tree into a pattern.
    
      case { type: 'foo', let y }:
      // If .x has a type property that is 'foo' or 'bar', and a
      // .y property, and no other properties, bind y to .y
      // and run the case code.
    
      // The presence of the token sequence `let y` authorizes the
      // decomposer to assign a value to `y` in the scope of the
      // statements after the `:`.
    
      case { type: 'bar', let y }:
      // When cases are not separated by a statement, they should
      // include the same `let`s with the same types.
      // Since a decomposer gets to specify the type of any symbols
      // introduced, it could use option types where some are not
      // present.
      // That kind of smarts requires that the decomposer be able
      // to correlate across cases.
    
        fooBarHandler(y);
        // Cases bodies do not fall-through without an explicit
        // `fallthrough`, so the value of the match as a whole is
        // the value of the last statement.
        // `;` is considered a statement for the purposes of
        // breaking series of adjacent cases.
    
      case { type: 'baz', let y, let z?, ... }:
      // Maybe the decomposer treats `...` as meaning don't exclude
      // other properties.
      // Maybe it treats the `z?` as meaning that z may be absent.
    
        bazHandler(y, z);
    
      default(i, e):
    
      // Without a `default`, =~ operations fail when all pattern matches
      // fail.  f the `default` has formal parameters like the `(i, e)`
      // above, then e binds to an opaque diagnostic.
    
      // C allows a `default` case in a `switch` statement to appear
      // anywhere so that it can fall through to a non-default case.
      // Temper will require default to be at the end since fall-through
      // from default is not required for any use case AFAICT.
    
        // Send the indices of the most likely patterns to some telemetry
        // system that might escalate to a human if there's an unusual
        // pattern of failure.
        failureTelemetry(i);
    
        // During interactive debugging, maybe the diagnostic could include
        // the source code of any patterns that tied for closest to matching.
        log(e);
    
        // Respond with a message to the client that sent the malformed
        // message.
        badInputHandler(x);
    }
    // Execution proceeds as if each pattern matcher was tried
    // in series.  The first pattern match to succeed determines the
    // series of statements to run.  Execution of `=~` succeeds when
    // a pattern succeeded and its corresponding statements succeed.
    // Execution also succeeds if no pattern match succeeded and there's
    // a default case whose statemens succeed.
    // Otherwise, `=~` fails.
    
    // Later we discuss how much latitude a generic query optimizer might
    // need to turn pattern matching into a more efficient operation based
    // on something like lookup tables &| decision trees.
    

    Meta-programmers

    Meta-programming often involves a lot of case-based analysis; given a value from a tagged union, dispatch its parts to code that handles its variant. This shows up repeatedly: parsers are a series of passes many of which receive trees and produce trees. Tree nodes are often defined using algebraic data types (ADTs), so the number of possible cases is fixed. Instructions in compiler intermediate forms similarly fall into a fixed set of instruction types.

    Whereas, for message routing, the router is checking for the presence, absence, and for key values in key-value bundles, in meta-programming code, generally, given a value in a union type, one wants to use RTTI to pick an appropriate strategy for handling it.

    There are some bits in a value's state vector that can be used to lookup into a jump table to jump to the appropriate handling code, and other parts of the state vector need to be bound to names used by that handling code.

    (x : option[t]) =~ {
      case Some(v): f(v);
      case None:    g();
    }

    Parser Authors

    Parsers are needed whenver incoming messages aren't in a generic format like JSON or XML. There are many tools like parser-generators that build parsers from declarative grammars, combinator libraries that allow specifying a parser in a declarative-ish subset of a language, etc. These are great for easily producing maintainable parsers for simple languages, but high-throughput systems almost never use them. Instead, at the heart of all these systems is hand-crafted code that parses complex formats because judicious use of lookup tables and context bits beats declarative DSLs.

    Given that authors are going to write recursive-descent-ish parsers in a mix of declarative-esque and hyper-optimized fast-path code, it'd be nice if large chunks of it used succinct yet expressive syntax like pattern matching to grab the substrings that are not boilerplate.

    The main tools in any parser author's toolkit are:

    When the content being processed is textual (as opposed to a token or event stream) additionally:

    In addition, parsers are much easier to follow when matching a prefix of the unprocessed input results in some default behavior, like advancing a cursor to the end of the processed prefix. This means that a successful match of a position in a stream, x, should be able to modify x to point to some position after it in the processed input. There is no need to allow pointing in between input elements, as to a byte that is not at the start of a UTF-8 multi-byte sequence.

    This need for a match to be able to update the matched is why we use a Perl-esque “=~” operator instead of a C-like “switch”. We could define “=~” as a match-and-assign while using infix “~” as match-without-assign.

    Reference and subregion imply the ability to deal with non-regular inputs which can be handled by chaining together multiple separate match operations. A regular grammar is defined as one that is expressible where every rule has at most one non-terminal on the left and the right and at most one terminal on the right. Allowing references in the middle of patterns is equivalent to allowing more than one non-terminal on the right. Regular expressions are optimizable in ways that non-regular ones are not, so the limited set of operators that would support regular expressions are sufficient for this use case and probably a good set for a query optimizer.

    // Imagine cursor points into an input buffer.
    (cursor) =~ {
      case let id = +/[a-z]/ & */[a-z0-9]/:
        // If control reaches here, then id is a slice
        // containing a letter followed by zero or more alphanumerics.
        // `&` means concatenation.  Maybe `||` means alternation.
        // `+` and `*` could mean repetition.
        // /…/ bracket character sets in RegExp format.
        // !(…) might be negative lookahead.
      default(i, e):
        // A good metric for the pattern that almost matched is
        // longest prefix consumed before failing.
        // Icon-style pass/fail semantics handle cursor resets
        // on failing branches, so cursor has its
        // original value.
    }
    // If control reaches here, then cursor points at the end of the
    // processed chunk
    

    Code Reviewers

    Pattern matching can fail because no pattern matched, i.e. there was a problem between each “case” and its “:”, or because the handler code failed, i.e. there was a problem after a “:”. Code reviewers do not want to spend time guessing or proving:

    If its easy for an author to put compiler-checked cues in their code that answers these questions, reviewers can focus on reasoning about cases individually instead of reasoning about properties of collections of cases.

    This is orthogonal to pattern matching though. The same problems arise with regular flow control, like chains of “if”…“else if”. Still, lets bikeshed some syntax to address this:

    (x) =~ {
      case …:
        !!!{
          // Maybe prefix `!!!` has no effect at runtime but causes a
          // fatal error if there is a path that leaves in failure.
        }
      default:
        woeisme;
        // Maybe the woeisme keyword is a statement that should be
        // eliminated as dead code, and the compiler issues a fatal error if
        // it can't.
    }
    

    Neither !!! nor woeisme are pattern matching specific, so { f(); woeisme; } might indicate that the author expects f() to never exit.

    Type Designers

    Type designers benefit from the flexibility to design succinct, expressive syntaxes that mirror how programmers think about values in the type.

    Ideally, they wouldn't need to do that for all types. For example, many language's define tagged unions in terms of a series of distinctly named, declarative constructors. Rust and Go allow constructing via a type name followed by a parenthesized actual lists, or a type name followed by a { propertyName: value, … }. Many languages use [&hellip] syntax for both composing and decomposing random-access sequences.

    This doesn't satisfy the parsing use case though.

    Maybe a construct like

    (x) =~ {
      case some tokens:       f();
      case let z more tokens: g();
      default(i, err):        h();
    }
    

    where the type of x is desugars to

    let decompositions = [
      // Pass in bundles of tokens.
      \( some tokens ),
      \( let z more tokens )
    ].map(T/* including any type params */.decomposePattern);
    
    // The language builtins include a macro that builds a set of
    // branches that assign smallInt or the diagnostic fields for
    // the fallback case.
    module.builtins.queryOptimize(
      decompositions,
      \( x ),
      // Then it needs the bodies of the cases to dispatch to.
      [
        ([],               \( f() )),
        ([\( z )],         \( g() )),
      ],
      // And it needs the default case.
      Some ([\( i ), \( err )], \( h() )),
    );
    

    The query optimizer would produce something semantically equivalent to the below, but hopefully that avoids duplicate work.

    all_done: {
      try_default: {
        try_case_1: {
          {
            do {code_to_match_case0} // generated from `some tokens`
            ||                       // `||` operator traps executes RHS when LHS fails
            do {break try_case_1}
          }
          f();
          break all_done;  // jumps out of block but using the result of f() as the result
        }
        let z;  // Variables extracted from pattern
        {
          do {code_to_match_case1}  // generated from `let z more tokens`
          ||
          do {break try_default}
        }
        g();
        break all_done;
      }
      let i, err;  // from default parameter list
      do {code_to_figure_out_best_almost_match}
      h();
    }
    

    This would allow a type designer to define a decomposePattern macro. In the absence of such an explicit declaration, the language could auto-generate one for certain kinds of type declarations.

    Type designers have an interest in keeping implementation details hidden, so any auto-generated decomposers should respect information hiding. Perhaps boxing the token bundles passed to .decomposePattern using the context of the current module“s keys would suffice.

    Shortcomings of Algebraic DataType Matching

    MLs like OCaml, SML-NJ, and Scala have flexible, expressive pattern-matching for algebraic data types (ADTs) that does not require supporting code by type authors. Given a type declaration:

    type α * β TypeName =  (* declares a type with two type parameters: α, β *)
        Constructor1 of α  (* the type is a tagged union *)
      | Constructor2 of β  (* each variant has its own state vector layout *)
    

    one can match values thus:

    match typeNameValue with
      | Constructor1(alphaPattern) -> …
      | Constructor2(betaPattern)  -> …
    

    How might one match a URL in this scheme? We might wade through standards documents and craft a union of record types.

    type Url =
        HierarchicalUrl of {
          scheme:    HierarchicalUrlScheme;
          authority: UrlAuthority;
          path:      UrlPath;
          query:     UrlQuery;
          fragment:  UrlFragment;
        }
      | OpaqueUrl of {
          schee:     OpaqueUrlScheme;
          content:   OpaqueUrlContent;
        }
    (* URLs have a complex structure with opaque terminology.
       Most code that uses URLs only touches on bits of this. *)
    

    but that makes for awkward patterns and requires users to remember obscure standards-document terminology:

    (* match http://example.com/... and https://example.com/...
       then do something with the path *)
    match urlValue with
      | HierarchicalUrl { scheme: HTTP; authority: EXAMPLE_DOT_COM; path: p; query: _; fragment: _ }
      | HierarchicalUrl { scheme: HTTPS; authority: EXAMPLE_DOT_COM; path: p; query: _; fragment: _ } -> f(p)
      | _ -> …
    

    Alternatively, we could hide that detail with a simpler type declaration, and move details about “schemes” and “authorities” into helper functions.

    type Url = Url of string
    

    makes it easy to match exact values

    match urlValue with
      | "http://example.com/"
      | "https://example.com/" -> …
    

    but now we can't use pattern matching to extract, e.g. the (decoded) “path” part.

    What we can't do is use a familiar syntax for matching URLs which is independent of, and can evolve independently of, the structure of values' state vectors.

    In ADTs, structure dictates interface, so pattern matching that is limited to and auto-derived for ADTs has these limitations:

    What's in a match

    Here, we derive an experimental query optimizer and some pattern decomposers so we can play around with them in the browser.

    We need a way to parse case blocks. We adapt the “toy grammar” which is not a grammar for Temper (still TBD as of this writing), and adapt it to treat case patterns as opaque parse trees.

    Let's try to parse a pattern match.

    You can see, highlighted in yellow, a caseblock……… AST node. The way we specified the grammar lets us consider case blocks separately.

    Implementing query optimizing code in JavaScript turned out to be hard, so I wrote the bulk of the experiment in Kotlin which compiles to JS. This HTML page loads that implementation and uses it to generate JS code that does pattern matching.

    Plumbing to ease calling JavaScript generated from Kotlin code.

    Now, given a caseblock……… we can parse it and dispatch it to a decomposer. You, dear reader, may safely skip over this glue code.

    Glue code that bridges case block ASTs with custom decomposers to produce Decompositions that feed into a generic compiler pass.

    Let's draft a simple decomposer for boolean values, to see how some of the pieces fit together. Later we'll consider more complicated cases to understand the other parts.

    So decomposing a pattern involves declaring some temporaries, and specifying requirements between them. Let's generate some code.

    The decomposer is careful to define the ranges so that they partition the space of all values comparable to true, which lets the QueryBuilder share a single comparison for both cases.

    Let's look at a more complicated example. Let' try some case blocks that compare against constant strings. (In production, we'd probably special case exact string matches to use perfect hashes, but strings are complex enough to let us explore how decision trees can deal with structured values.)

    Here's what we get when we decompose “foo” into less-than relationships.

    There are several things to note here:

    And if we pad the strings so that length is not a differentiator, we get something a bit simpler.

    TODO: more complex values including RTTI requirements.

    TODO: loops oh my!

    TODO: identifying dead cases and coverage via regular language intersection.

    Thanks for reading!