Day 5 – The Elves go back to Grammar School

It was Christmas Day in the workhouse

the snow was raining fast

a barefooted man with clogs on

came slowly running past

anon

‘Twas the month before Christmas when a barefooted Elf with clogs on realised that they needed to load around 1_000_000_000 addresses into Santa’s Google maps importer so that he could optimise his delivery route and get all the presents out in one night.

On the plus side, all the addresses came in via emails and pdf attachments as computer readable unicode text. But how to read all those addresses into a consistent format so that they could be fed to the maps app?

Then they remembered that Mrs. Claus had sent them all to (Raku) Grammar school and that now was the time to brush up on their parsing skills.

Peeking at the Answer

So, to work backwards, as is most comfortable for Elven folk, we start with some address text and feed it to be parsed by a raku Grammar.

#!/usr/bin/env raku
use v6.d;
use lib '../lib';
use Data::Dump::Tree;
use Grammar::Tracer;
my ($address,$match);
$address = q:to/END/;
123, Main St.,
Springfield,
IL 62704
USA
END
$address ~~ s:g/','$$//;
$address ~~ s:g/<['\-%]>//;
$address .= chomp;
$match = AddressUSA::Grammar.parse($address, :actions(AddressUSA::Actions));
say ~$match;
say $match;
ddt $match.made;

Note the use of Grammar::Tracer, this is an essential design and debug tool to work out what is happening in the parsing process with various text inputs.

There’s a little preparation of our text to prune out unwanted punctuation and line endings using the ~~ s:g/// substitution operator.

The result of the parsing operation is a new Address object, so the output with ddt is:

.AddressUSA @0
├ $.street = 123, Main St..Str
├ $.city = Springfield.Str
├ $.state = IL.Str
├ $.zipcode = 62704.Str
└ $.country = USA.Str

All neatly tagged and ready for import to a structured system.

Grammar, Rhyme and Meter

The Grammar itself looks like this..

grammar AddressUSA::Grammar does Address::Grammar::Base {
token TOP {
<street> \v
<city> \v
<state-zip> \v?
[ <country> \v? ]?
}
token state-zip {
^^ <state> <.ws>? <zipcode> $$ #<.ws> is [\h* | \v]
}
token state {
\w \w
}
token zipcode {
\d ** 5
}
}

Sooo – as with any well written elven code, it is easy for the casual reader to grasp what is going on (with a little prior know how of the Raku Grammar basics work, such as token TOP)

Casting a more expert eye, Rudolph (for it was he that crafted this tutorial) has a few expert remarks:

  • note the use of \v and \h to denote vertical and horizontal whitespace
  • since Raku does not dictate indentation, we can go with an aligned layout to better echo our input text
  • the query ? and square brackets [] help us with options (e.g. there may be no coontry line in the text)
  • token state looks for two word characters
  • token zipcode looks for 5 digits
  • there is a bit of magic in <state-zip> that picks out <state> and <zipcode> regardless of whether they are on the same line or on two lines
  • oh, and ^^ is the start of a line and $$ is the end (ie just before a \v)

So far, so ginchy.

Grammar Fundaments

Well, Rudolph was holding a bit up his sleeve (ed?) here, since raku doesn’t come with tokens like <street>, <city>, etc out of the box. An attentive Elf would spot the does Address::Grammar::Base … let’s look at that now:

my @street-types = <Street St Avenue Ave Av Road Rd Lane Ln Boulevard Blvd>;
role Address::Grammar::Base {
token street {
^^ [<number> ','? <.ws>]? <plain-words> <.ws> <street-type> '.'? $$
}
token number {
\d*
}
token plain-words {
<plain-word>+ % \h
}
token plain-word {
\w+ <?{ "$/" !~~ /@street-types/ }>
}
token street-type {
@street-types
}
token town { <whole-line> }
token city { <whole-line> }
token county { <whole-line> }
token country { <whole-line> }
token whole-line {
^^ \V* $$
}
}

So that explains where the Base tokens are built up and underlines the fact that a Raku grammar is just a fancy class and that tokens are just fancy methods. That way you can use the same role composition via the keyword does.

So key tools used here:

  • You can use any Raku Array like ‘@street-types‘ directly in a token, the elements are treated as if this was a set of alternate Str values [‘Street’ | ‘St’ | …]
  • The look-ahead assertion <?{ … }> is a cool way to call any code that returns True or False – in this case to check that the (stringified) match “$/” is not a street-type
  • And the “repeats” symbol % modifies the quantifier to add a separator pattern – in this case \h for a horizontal whitespace character

Actions of Compassion

“In the quaint streets of Victorian London, a benevolent soul anonymously gifted warm blankets to shivering orphans, embodying the true spirit of Christmas compassion in Dickensian fashion.”

ChatGPT 3.5

That just leaves the grammar Actions to be added in. In our example below, first we define our class AddressUSA to hold the results and then we make a new instance of it with the action method TOP

class AddressUSA {
has Str $.street;
has Str $.city;
has Str $.state;
has Str $.zipcode;
has Str $.country = 'USA';
}
class AddressUSA::Actions {
method TOP($/) {
my %a;
%a<street> = $_ with $<street>.made;
%a<city> = $_ with $<city>.made;
%a<state> = $_ with $<state-zip><state>.made;
%a<zipcode> = $_ with $<state-zip><zipcode>.made;
%a<country> = $_ with $<country>.made;
make AddressUSA.new: |%a
}
method street($/) { make ~$/ }
method city($/) { make ~$/ }
method state($/) { make ~$/ }
method zipcode($/) { make ~$/ }
method country($/) { make ~$/ }
}

The action methods use make and made commands to percolate those parts of the Grammar that we want to select up to the method TOP and then into the new object attributes with Address.new: |%a

  • This shows that the associative accessors we normally see with hashes like %a<street> are very similar to accessors on the match object itself like $<street>, because (err) they are the same just that $<> is syntatic sugar for $/<>
  • You can use them like %a{‘street’} if you prefer more typing and more line noise

And so, my fine friends, to another year of -0fun

There is nothing in the world so irresistibly contagious as laughter and good humor.

Charles Dickens, A Christmas Carol

I wish each and every one of you a Merry Christmas and a Happy New Year.

~librasteve

PS. All the code for this story is available as a single file in this… Github Gist

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.