Richard wanted to…
Provide a simple way of naming and picking out information from a regex without having to count the brackets.
I can say without hesitation that Raku (and before its rename, Perl 6) has achieved this goal — but all the details are different than proposed.
The reason is two-fold.
For one, Richard assumed a pretty straight-forward extension to Perl 5’s regex syntax, and his proposed syntax,
(?$hours=..) for a named capture made sense. Instead, Raku regex syntax is pretty much a new thing, where all non-alphanumeric characters are potentially meta characters, and thus either used or reserved for one purpose or another. This made easier syntax than
(?$name=regex) available for named captures.
The second is even more profound: The Raku designers realized that regexes could only be truly powerful if reuse was built in from the ground up. And the best way to make that happen was to make them first class.
In my humble opinion, raising regexes to the level of first-class citizens and introducing a concise call syntax gave regexes a similar boost.
In the old days, it was common wisdom that you cannot parse XML (or other arbitrarily nested languages) with regexes, because they are not a regular language in the computer science sense. Perl 5 has some workarounds for that, but they are so clunky and verbose that I haven’t even seen them recommended much, and my general impression is that if you use them, it’s just for the lack of good alternatives.
Not so in Raku:
<subrule> in a regex calls another regex called
subrule, and so you have recursion (and, relevant to the discussion of RFC 112, named captures). This recursion moves regexes from regular into context-free language territory in the Chomsky Hierarchy. But more than recursion, the named regexes allow much easier reuse, testing in isolation and all that other wonderful stuff that first-classiness gave to functions. It also moved the sentiment towards parsing XML and other languages with regexes from “are you serious?” to “sure, it’s the best tool”.
The call syntax
<subrule> implies a named capture, and it turns out that’s convenient enough that explicit named captures (not tied to a call) are actually pretty rare in real-world parsers. An explicit syntax for that exists though, it’s
Which brings us to the second interesting bit: RFC 112 doesn’t just talk about named captures, but implies that they are directly stored into variables of the same name.
This is problematic for a variety of reasons:
- Scoping. A regex can be declared and used at two very different parts of the program. Forcing the variable to be in scope would make the capture syntax a source variables with an unnecessary large scope (dare I say global?), which is a clear anti pattern.
- Quantifiers. In RFC 211 syntax, what would have happened with a regex like
(?$char=.)+matching the string
abc? What’s in
$char? The sigil implies a scalar, so… maybe the last capture,
c? And throw away all the other matches? Doesn’t sound too good. Or maybe
(?@char=.)would have an array with all captures, but then, when writing a regex, you’d have to know if anybody later wants to use that inside a quantifier. Not stellar either.
- Composition. Binding matches to a variable assumes the regex is used as a top-level construct, and not part of larger thing.
- Recursion. Do I even need to elaborate? Probably not.
The solution that’s implemented in Raku now is more suited to world of first-class regexes: for each regex match there’s a Match object. The top-level match object is stored in the variable
$/, so accessing a named capture
$/<key>, and there’s even a short-hand for that,
$<key>. Just two character longer than the originally proposed
This solution shines though when used in the context of composition, for example. Since the named capture corresponds to a regex match, it’s also a
Match object, and so we arrive at a tree of matches (all alike). Or rephrased: a regex match already is a syntax tree.
I think this RFC is a good example how a real pain point and problem was identified, and a solution proposed. Aspects of this solution have survived the language design process, but most details haven’t, because the language changed much more than it barely being Perl 5 plus a few extensions through RFCs.
I have been participating in the Perl 6 project since around the year 2007, and have watched some of these transformations; for regexes, the majority of the consolidation and redesign work had already been done. I watched the implementations become more powerful, and even helped a little here and there.
Living in this process was a magical experience, just as magical as the result is now.
2 thoughts on “RFC 112 by Richard Proctor: Assignment within a regex”
Nice article, but I think it would be a great addition if the author could include a codeblock or graphic at the top of the article, showing the current “Raku 2020” implementation of Named Captures.