RFC 145, by Eric J. Roode: Brace-matching for Perl Regular Expressions

Problem and proposal

The RFC 145 calls for a new regex mechanism to assist in matching paired characters like parentheses, ensuring that they are balanced. There are many “paired characters” in more or less daily use: (), [], {}, <>, «», "", '', depending on your local even »«, or in the fancy world of Unicode additionally ⟦⟧ and many, many more. In this article I will take up the RFC’s title and call all of them “braces”.

For example, consider the string ([b - (a + 1)] * 7). We might wish to extract all the subformulas

  • [b - (a + 1)] * 7,
  • b - (a + 1),
  • a + 1

from it, all of which are surrounded by a matching pair of braces using a global match. The reader is invited to try to write such a regex now.

The RFC author Eric Roode notes that this was still quite difficult in Perl in the year 2000. The task splits into two parts:

  1. Determining for an opening bracket what is its closing counterpart.
  2. Keeping track of the nesting levels and matching braces at each level.

The first subtask becomes hairy in a regex when there are multiple options for the opening bracket. The second subtask is hard for a more profound reason which goes by the name of “Dyck language“. The Dyck language is the set of all strings of properly paired parentheses (with hypothetical contents between them erased). It is the prototypical example of a language in the computer-science sense which is not regular but still context-free, meaning that it somehow needs a stack to keep track of nesting levels. Of course, regexes are more powerful than computer-scientific regular expressions but this fact may still justify why this is a difficult thing to do. Eric Roode recognized the gap between how easy this very common task in parsing structured data should be and how easy it is and wrote an RFC.

He proposed a pragma use matchpairs to solve subtask № 1 by providing a map from opening to closing braces. Pragmas are activated in a lexical scope and influence all regex matches in it. For subtask № 2 two new regex metacharacters were proposed, \m and \M for matching and remembering corresponding braces. Using these hooks, the nesting level business is offloaded onto the regex engine.

Spec and solution

RFC 145 is marked “developing”, meaning that it was not fully addressed in the Perl 6, and now Raku, specification. (Apocalypse 5 on pattern matching includes a response to RFC 145.) But there have been related improvements which I am going to use in this section to show how the problem posed in the beginning might be handled in Raku today.

The idea of using a pragma to set up a table of valid braces and then using “brace” regex metacharacters was not implemented, but the regex language was to be redesigned anyway and the designers extrapolated from brace matching and created a new regex operator for nesting structures, the tilde. This operator is used like this:

anon regex { '(' ~ ')' <body> }

and it achieves two things: it transposes body and closing brace so that the two delimiters are close to each other, even when <body> is long, and it sets up error reporting for when the closing brace was not found.

We can use this new feature to slightly improve the regex structure and get error reporting for free, but it does not keep track of nesting levels of the parentheses and it does not compute the closing brace for us if there had been multiple options for the opening one.

To compute the closing brace, it would suffice to have a way to capture the opening brace and pass it to a function whose return value is dynamically interpolated into the regex. This is now easy in Raku regexes and grammars:

grammar Formula {
    # Registry of understood braces.
    constant %braces =
        '(' => ')',
        '[' => ']',
        '{' => '}',
    ;

    # A parametric token which matches the closing brace
    # corresponding to its argument.
    token closing ($opening) {
        "%braces{$opening}"
    }

    rule braced {
        $<opening>=@(%braces.keys) ~ <closing($<opening>)>
          [ <expr> {} ]
    }

    rule expr {
        [ <:Letter>+ || <:Number>+ || <braced> ]+ % <[+*/-]>
    }
}

The crucial part is rule braced.¹ We capture the opening brace and then later ask for its corresponding closing brace from a lookup in the %braces map.² The @(%braces.keys) interpolation of a list invokes longest-token matching, so it will DWIM when multiple braces with overlapping prefixes are present.

Notice that the mutually recursive use of the <expr> and <braced> rules ensures correct nesting of braces without needing a dedicated gear for this in the regex engine. It falls out of Raku’s improved regex structuring and reusing facilities. It is time for a test:

grammar Formula { … }
sub braced-subexprs ($expr) { … }

braced-subexprs Q|([b - (a + 1)] * 7)|;
-- ([b - (a + 1)] * 7) ---------------------------------------------------------
Braces: ( * ) ||| Subexpr: a + 1
Braces: [ * ] ||| Subexpr: b - (a + 1)
Braces: ( * ) ||| Subexpr: [b - (a + 1)] * 7

Summary

In summary, brace matching is obviously useful in parsing structured data. It was proposed by Eric Roode to make this simple in Perl 6 / Raku. Although the feature was not implemented in the proposed form, the task has indeed become easier to accomplish and the code much easier to read, notably due to the new regex syntax and grammar support.

Encore!

If, like me, you are slightly bothered by the static brace table but are fine with heuristics, then the Unicode Consortium may be an unexpected ally. The Unicode Bidi_Mirroring_Glyph property gives hints about bidirectional writing, that is putting text on the screen when multiple scripts are involved, some of which write left-to-right and others right-to-left. Raku has built-in support for Unicode properties and we can use this one to let the Unicode Consortium pick closing braces for us:

    sub unicode-mirror ($_) {
        join '', .comb.reverse.map: {
            .uniprop('Bidi_Mirroring_Glyph')
                or .self
        }
    }

    token closing ($opening) {
        "{ unicode-mirror($opening) }"
    }

    regex braced {
        :sigspace
        $<opening>=<:Symbol + :Punctuation>+ ~ <closing($<opening>)>
        [ <expr> {} ]
    }

The &unicode-mirror heuristic splits the argument into characters, reverses their order and then either picks its mirroring glyph, if one is defined, or leaves the character as-is, then reassembles them into a string. This function successfully turns <{ into }>, for example.

braced was tweaked in two regards: it accepts any sequence of symbols and punctuation as opening braces now and it has been turned into a regex for full backtracking power when it is too greedy in consuming opening braces.

With these tweaks, we can go nuts and have the grammar do free association and match everything that “looks like a brace pair”:

-- ([b - (a + 1)] * 7) ---------------------------------------------------------
Braces: ( * ) ||| Subexpr: a + 1
Braces: [ * ] ||| Subexpr: b - (a + 1)
Braces: ( * ) ||| Subexpr: [b - (a + 1)] * 7

-- (=^123^=) -------------------------------------------------------------------
Braces: (=^ * ^=) ||| Subexpr: 123

-- <<<123>> --------------------------------------------------------------------
FAILED

-- >123< -----------------------------------------------------------------------
Braces: > * < ||| Subexpr: 123

-- >123> -----------------------------------------------------------------------
FAILED

-- <{ (a + <b>) / !c! / e * »~d~« }> -------------------------------------------
Braces: < * > ||| Subexpr: b
Braces: ( * ) ||| Subexpr: a + <b>
Braces: ! * ! ||| Subexpr: c
Braces: »~ * ~« ||| Subexpr: d
Braces: <{ * }> ||| Subexpr: (a + <b>) / !c! / e * »~d~«

Footnotes

The function used to report braced subexpressions is this:

sub braced-subexprs ($expr) {
    # Get all submatches of the C<braced> subrule.
    class BracedCollector {
        has @.braced-subexprs;

        method braced ($/) {
            push @!braced-subexprs, $/
        }

        method braced-subexprs {
            @!braced-subexprs.unique(as => *.pos)
        }
    }

    say "-- $expr ", '-' x (76 - $expr.chars);

    my BracedCollector $collect .= new;
    say "FAILED" and return
        unless Formula.parse($expr, :rule<expr>, :actions($collect));

    for $collect.braced-subexprs -> $/ {
        say "Braces: $<opening> * $<closing> ||| Subexpr: $<expr>";
    }
}

¹ In case you are wondering about the use of an empty block in [ <expr> {} ], this is due to an implementation detail in Rakudo’s regex engine which does not make the capture $<opening> available to a later subrule closing unless it is forced to. The empty block is one way to force it; cf. RT#111518 and DOC#3478.

² The essential feature of interpolating back the return value of a function call closing $<op> which may depend on previous captures was added, to the best of my knowledge, also around the year 2000 (so about the time this RFC was posted), to Perl 5.6, in this case with the spelling (??{closing $+{op}}).

3 thoughts on “RFC 145, by Eric J. Roode: Brace-matching for Perl Regular Expressions

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: