RFC 112 by Richard Proctor: Assignment within a regex

Richard wanted to

Provide a simple way of naming and picking out information from a regex without having to count the brackets.

I can say without hesitation that Raku (and before its rename, Perl 6) has achieved this goal — but all the details are different than proposed.

The reason is two-fold.

For one, Richard assumed a pretty straight-forward extension to Perl 5’s regex syntax, and his proposed syntax, (?$hours=..) for a named capture made sense. Instead, Raku regex syntax is pretty much a new thing, where all non-alphanumeric characters are potentially meta characters, and thus either used or reserved for one purpose or another. This made easier syntax than (?$name=regex) available for named captures.

The second is even more profound: The Raku designers realized that regexes could only be truly powerful if reuse was built in from the ground up. And the best way to make that happen was to make them first class.

I want to dwell on that point a bit: consider the power of functions (and closures) as first-class citizens in modern programming languages. Lisp has shown us what you can do with them, and now basically every programming language has got them. Dynamic languages like Perl, Ruby, Javascript and Python were pretty early adopters, modern statically typed languages like C# and F# also got them; even Java caught up eventually. Java didn’t even have functions, just methods, and now it’s got closures that you can pass around.

In my humble opinion, raising regexes to the level of first-class citizens and introducing a concise call syntax gave regexes a similar boost.

In the old days, it was common wisdom that you cannot parse XML (or other arbitrarily nested languages) with regexes, because they are not a regular language in the computer science sense. Perl 5 has some workarounds for that, but they are so clunky and verbose that I haven’t even seen them recommended much, and my general impression is that if you use them, it’s just for the lack of good alternatives.

Not so in Raku: <subrule> in a regex calls another regex called subrule, and so you have recursion (and, relevant to the discussion of RFC 112, named captures). This recursion moves regexes from regular into context-free language territory in the Chomsky Hierarchy. But more than recursion, the named regexes allow much easier reuse, testing in isolation and all that other wonderful stuff that first-classiness gave to functions. It also moved the sentiment towards parsing XML and other languages with regexes from “are you serious?” to “sure, it’s the best tool”.

The call syntax <subrule> implies a named capture, and it turns out that’s convenient enough that explicit named captures (not tied to a call) are actually pretty rare in real-world parsers. An explicit syntax for that exists though, it’s $<capturename>=[...].

Which brings us to the second interesting bit: RFC 112 doesn’t just talk about named captures, but implies that they are directly stored into variables of the same name.

This is problematic for a variety of reasons:

  • Scoping. A regex can be declared and used at two very different parts of the program. Forcing the variable to be in scope would make the capture syntax a source variables with an unnecessary large scope (dare I say global?), which is a clear anti pattern.
  • Quantifiers. In RFC 211 syntax, what would have happened with a regex like (?$char=.)+ matching the string abc? What’s in $char? The sigil implies a scalar, so… maybe the last capture, c? And throw away all the other matches? Doesn’t sound too good. Or maybe (?@char=.) would have an array with all captures, but then, when writing a regex, you’d have to know if anybody later wants to use that inside a quantifier. Not stellar either.
  • Composition. Binding matches to a variable assumes the regex is used as a top-level construct, and not part of larger thing.
  • Recursion. Do I even need to elaborate? Probably not.

The solution that’s implemented in Raku now is more suited to world of first-class regexes: for each regex match there’s a Match object. The top-level match object is stored in the variable $/, so accessing a named capture key is $/<key>, and there’s even a short-hand for that, $<key>. Just two character longer than the originally proposed $key.

This solution shines though when used in the context of composition, for example. Since the named capture corresponds to a regex match, it’s also a Match object, and so we arrive at a tree of matches (all alike). Or rephrased: a regex match already is a syntax tree.


I think this RFC is a good example how a real pain point and problem was identified, and a solution proposed. Aspects of this solution have survived the language design process, but most details haven’t, because the language changed much more than it barely being Perl 5 plus a few extensions through RFCs.

I have been participating in the Perl 6 project since around the year 2007, and have watched some of these transformations; for regexes, the majority of the consolidation and redesign work had already been done. I watched the implementations become more powerful, and even helped a little here and there.

Living in this process was a magical experience, just as magical as the result is now.

2 thoughts on “RFC 112 by Richard Proctor: Assignment within a regex

  1. Nice article, but I think it would be a great addition if the author could include a codeblock or graphic at the top of the article, showing the current “Raku 2020” implementation of Named Captures.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: