Day 15 – An expression language for Vixen

#raku-beginners: korvo: Hi! I’m trying out Raku in stead of META II for a toy compiler.

(5 minutes later) korvo: I’m merely trying to compile a little expression language because my angle of repose has slightly increased, and multiple folks have recommended Raku for parsing and lightweight compilers.

(20 minutes later) korvo: [T]hanks for a worthy successor to META II. This is the first enjoyable parser toolkit I’ve used in a while; I spent almost no time fussing over the Regex tools, found it easy to refactor productions, and am spending most of my time trying to handle strings and lists and I/O.

Happy Holiday! As we enter the winter-solstice season, it’s worth reflecting on the way that hunkering down for the cold can bring about new avenues of exploration. I have a jar of pickled homegrown banana peppers in the corner of my refrigerator slowly evolving into something delicious; I do have to shake it every few days, but it will be worth it to add that fruity sour spice to any dish. Similarly, this can be a season for letting old concepts ferment into new concepts.

I have written so many parsers and parser toolkits. I’m tired of writing parsers yet again. So, after hearing about Raku’s builtin support for grammars, I decided that I would commit to trying Raku for my next parsing task. How hard could it be? I’ve already used and forgotten Perl 5 and Ruby.

I don’t need to reinvent Raku. ~ me, very tired

Problem

I’m working on a language that I call Vixen. I should back up.

At the beginning of the year, Joel Jakubovic blogged that The Unix Binary wants to be a Smalltalk Method, Not an Object. They argue that, while we have traditionally thought that Unix files correspond to objects, we should instead correspond Unix directories with objects and Unix files with methods. By “object” I mean a bundle of state and behavior which communicates with other objects by passing messages to them. This is a big deal for folks who study what “object” means, but not really interesting for the wider programming world. However, they followed it up with a prototype and a paper, The Unix Executable as a Smalltalk Method: And its implications for Unix-Smalltalk unification. Jakubovic provides a calling convention, which we call Smalltix, that allows us to treat a Unix system as if it were a Smalltalk-like message-passing object-based system. Crucially, there isn’t a single language for programming Smalltix, because of fragmentation: a Unix system already has many different languages for writing executable programs, and adding another language would only fragment the system further.

Okay! So, I’m working on Vixen, a fork of Smalltix. Jakubovic used Bash and Smalltalk-style classes; I’m simplifying by using execline and Self-style prototypes. Eventually, I’ve got a few dozen little scripts written with execline. Can I simplify further?

Now, I fully admit that execline is something of an alien language, and I should explain at least some of it before continuing. Execline is based on the idea of Bernstein chain loading; the interpreter takes all arguments in argv as a program and calls into multiple helpers which incrementally rewrite argv into a final command. Here’s an example method that I call “debug:” which takes a single argument and prints it to stderr. First it uses the fdmove helper to copy file descriptor 2 to file descriptor 1, shadowing stdout with stderr; finally, it echoes a string that interpolates the first and second items of argv. The calling convention in Smalltix and Vixen is that argv’s zeroth item is the method, the first item is the receiving object passed as a path to a directory, and the rest of the items are positional arguments. By tradition, there is one colon “:” in the name of a method per argument, so “debug:” takes one argument; also by tradition, the names of methods are called verbs. Since this method takes one positional argument, we pass the -s2 flag to the execline interpreter execlineb to collect argv up to index 2.

#!/usr/bin/env -S execlineb -s2
fdmove -c 1 2
echo "debug: ${1}: ${2}"

For something more complicated, here is a method named “allocateNamed:” which augments some other “allocate” method with the ability to control the name of a directory. This lets us attach names to otherwise-anonymous objects. Here, we import the name “V” from the environment envp to turn it into a usable variable. In Vixen, I’ve reserved the name “V” to refer to a utility object that can perform calls and other essential tasks. The backtick helper wraps a subprogram in a curly-brace-delimited block and captures its output. The foreground helper runs two subprograms in sequence; there’s also an if helper which exits early if the first subprogram fails.

#!/usr/bin/env -S execlineb -s2
importas -iS V
backtick -E path { ${V}/call: $1 allocate }
foreground { mkdir ${path}/${2} }
echo ${path}/${2}

Now, as the reader may know, object-based languages are all about messages, object references, and passing messages to object references. In some methods, like this one called “hasParent”, we are solely passing messages to objects; the method is merely a structure which composes some other objects. This is starting to be a lot of code; surely there’s a better way to express this composite?

#!/usr/bin/env -S execlineb -s1
importas -iS V
backtick -E parent { ${V}/call: $1 at: "parent*" }
${V}/call: $parent exists

Syntax

Let’s fragment the system a little bit by introducing an expression language just for this sort of composition. Our justification is that we aren’t going to actually replace execline; we’re just going to make it easier to write. We’ll scavenge some grammar from a few different flavors of Smalltalk. The idea is that our program above could be represented by something like:

[|^(self at: "parent*") exists]

For non-Smalltalkers, this is a block, a fundamental unit of code. The square brackets delimit the entire block. The portion to the right of the pipe “|” is a list of expressions; here there is only one. When the final expression starts with a caret “^”, it will become the answer or return value; there’s a designated Nil object that is answered by default. Expressions are merely messages passed to objects, with the object on the left and the message on the right. If a message verb ends with a colon “:” then it is called a keyword and labels an argument; for each verb with a colon there is a corresponding argument. The builtin name self refers to the current object.

The parentheses might seem odd at first! In Smalltalk, applications of verbs without arguments, so-called unary applications, bind tighter than keyword applications. If we did not parenthesize the example then we would end up with the inner call "parent*" exists, which is a unary application onto a string literal. We also must parenthesize to distinguish nested keyword applications, as in the following example:

[:source|
obj := (self at: "NixStore") intern: source.
^self at: obj name put: obj]

Here we can see the assignment token “:=” for creating local names. The full stop “.” occurs between expressions; it creates statements, which can either assign to a name or not. We can also see a parameter to this block, “:source”, which occurs on the left side of the pipe “|” and indicates that one argument can be passed along with any message.

Grammar

Okay, that’s enough of an introduction to Vixen’s expression language. How do we parse it? That’s where Raku comes in! (As Arlo Guthrie might point out, this is a blog post about Raku.) Our grammar features everything I’ve shown so far, as well as a few extra features like method cascading with the semicolon “;” for which I don’t have good example usage.

grammar Vixen {
    token id       { <[A..Za..z*]>+ <![:]> }
    token selector { <[A..Za..z*]>+ \: }
    token string   { \" <-["]>* \" }
    token param    { \: <id> }
    token ass      { \:\= }

    rule params { <param>* % <ws> }

    proto rule lit             {*}
          rule lit:sym<block>  { '[' <params> '|' <exprs> ']' }
          rule lit:sym<paren>  { '(' <call> ')' }
          rule lit:sym<string> { <string> }
          rule lit:sym<name>   { <id> }

    rule chain { <id>* % <ws> }

    proto rule unary {*}
          rule unary:sym<chain> { <lit> <chain> }

    rule keyword { <selector> <unary> }

    proto rule message {*}
          rule message:sym<key>  { <keyword>+ }

    rule messages { <message>* % ';' }

    proto rule call {*}
          rule call:sym<cascade> { <unary> <messages> }

    proto rule assign {*}
          rule assign:sym<name> { <id> <ass> <call> }
          rule assign:sym<call> { <call> }

    rule statement { <assign> '.' }

    proto rule exprs {*}
          rule exprs:sym<rv>  { <statement>* '^' <call> }
          rule exprs:sym<nil> { <statement>* <call> }

    rule TOP { '[' <params> '|' <exprs> ']' }
}

Writing the grammar is mostly a matter of repeatedly giving it example strings. The one tool that I find indespensible is some sort of debugging tracer which indicates where a parse rule has failed. I used Grammar::Tracer, available via zef. I’m on NixOS, so language-specific package managers don’t always play nice, but zef works and is recommended. First I ran:

$ zef install Grammar::Tracer

And then I could start my file with a single import in order to get tracing:

import Grammar::Tracer;

Actions

The semantic actions transform the concrete syntax tree to abstract syntax. This sort of step is not present in classic META II but is essential for maintaining sanity. I’m going to use this grammar for more than a few weeks, so I wrote a few classes for representing abstract syntax and a class of actions. Some actions are purely about extraction; for example, the method for the params production merely extracts a list of matches and extracts the Str for each match.

    method params($/) { make $.values.map: *.Str; }

Some actions contain optimizations that avoid building abstract syntax. The following method for unary handles chained messages, where we have multiple unary applications in a row; we want a special case for zero applications so that the VUnary class can assume that it always has at least one application.

    method unary:sym<chain>($/) {
        my $receiver = $.made;
        my @verb = $.made;
        make @verb ?? VUnary.new(:$receiver, :@verb!! $receiver;
    }

Some actions build fresh abstract syntax not in the original program. The following method for exprs handles the case when there is no return caret; the final expression is upgraded to a statement which ignores its return value and the name Nil is constructed as the actual return value.

    method exprs:sym<nil>($/) {
        my @statements = $.values.map: *.made;
        my $call = $.made;
        @statements.push: VIgnore.new(:$call);
        my $rv = VName.new(:n("Nil"));
        make VExprs.new(:@statements, :$rv);
    }

Getting the actions right was difficult. I ended up asking for hints on IRC about how to work with matches. The .values method is very useful.

Abstract syntax

I had a couple false starts with the abstract syntax. I think that the right mentality is to have one node per production, but to have one role per compiler action. If necessary, change the grammar to make the abstract syntax easier to generate; Raku is flexible enough to allow grammars to be refactored. Rules like params, chain, and keyword were broken out to make life easier.

By the way, starting at this point, I am only showing excerpts from the complete compiler. The compiler is available in a separate gist. Classes may be incomplete; only relevant methods and attributes are shown.

For example, there is a role for emitting literals. A parenthesized call just unwraps the parentheses; a string is represented by itself.

role EmitLiteral {
    method prepareLiteral($compiler) { ... }
}
class VParen does EmitLiteral {
    has Call $.call;
    method prepareLiteral($compiler) { $.call.prepareLiteral: $compiler; }
}
class VStr does EmitLiteral {
    has Str $.s;
    method prepareLiteral($compiler) { $.s; }
}

We can also have a role for performing a call. We need two flavors of call: call and bind to a name, and also call without binding. It’s much easier to compile chains and cascades with the option to bind or not bind. We can put both roles onto a single class, so that a cascading application both performs a call and also evaluates to a literal expression.

role Call {
    method prepareBind($name, $compiler) { ... }
    method prepareOnly($compiler) { ... }
}
class VCall does Call does EmitLiteral {
    has EmitLiteral $.unary;
    has VMessage @.cascades;
    # NB: @cascades is inhabited!
    method prepareBind($name, $compiler) {
        my $receiver = $.unary.prepareLiteral: $compiler;
        my $last = @.cascades[*-1];
        for @.cascades[0 ...^ @.cascades.elems - 1] {
            my ($verb, @row= $_.prepareMessage: $compiler;
            $compiler.callOnly: $receiver, $verb, @row;
        };
        my ($verb, @row= $last.prepareMessage: $compiler;
        $compiler.callBind: $name, $receiver, $verb, @row;
    }
    method prepareOnly($compiler) {
        my $receiver = $.unary.prepareLiteral: $compiler;
        for @.cascades {
            my ($verb, @row= $_.prepareMessage: $compiler;
            $compiler.callOnly: $receiver, $verb, @row;
        };
    }
    method prepareLiteral($compiler) {
        my $name = $compiler.gensym;
        self.prepareBind: $name, $compiler;
        "\$" ~ $name;
    }
}

A first compiler

We’ll start by compiling just one block. Our compiler will include a gensym: a method which can generate a symbol that hasn’t been used before. I’m not trying very hard here and it would be easy for a malicious user to access generated symbols; we can fix that later. The compiler is mostly going to store calls; each call can either be a backtick or an if (or foreground) depending on whether it binds a name.

class Compiler {
    has Int $!syms;
    method gensym { $!syms += 1; "gs" ~ $!syms; }

    has Str %.lines;
    method push($line) { %.lines
 ~= $line ~ "\n"; }

    method callBind($name, $receiver, $verb, @args) {
        self.push: "backtick -E $name \{ " ~ formatCall($receiver, $verb, @args~ " \}";
    }
    method callOnly($receiver, $verb, @args) {
        self.push: "if \{ " ~ formatCall($receiver, $verb, @args~ " \}";
    }

    method assignName($from, $to) { self.push: "define $to \$$from"; }
}

The method .assignName is needed to handle assignments without intermediate calls, as in this := that.

class VName does Call does EmitLiteral {
    has Str $.n;
    method prepareBind($name, $compiler) { $compiler.assignName: $.n, $name; }
    method prepareOnly($compiler) {;}
    method prepareLiteral($compiler) { "\$" ~ $.n; }
}

Calling into Vixen

To compile multiple blocks, we will need to emit multiple blocks. A reasonable approach might be to emit a JSON Object where each key is a block name and each value is a String containing the compiled block. I’m feeling more adventurous than that, though. Here’s a complete Smalltix/Vixen FFI:

sub callVixen($receiver, $verb, *@args) {
    my $proc = run %*ENV ~ "/call:", $receiver, $verb, |@args, :out;
    my $answer = $proc.out.slurp: :close;
    $proc.sink;
    $answer.trim;
}

Vixen is merely a calling convention for processes; we can send a message to an object by doing some string formatting and running a subprocess. The response to a message, called an answer, is given by stdout. Non-zero return codes indicate failure and stderr will contain useful information for the user. The rest of the calling convention is handled by passing envp and calling the V/call: entrypoint.

In addition to passing V in the environment, we will assume that there are Allocator and NixStore objects. Allocator allocates new objects and NixStore interacts with the Nix package manager; we will allocate a new object and store it in the Nix store. The relevant methods are V/clone: anAllocator, which allocates a shallow copy of V and serves as a blank object template, and NixStore/intern: anObject, which recursively copies an object from a temporary directory into the Nix store.

The reader doesn’t need to know much about Nix. The only relevant part is that the Nix store is a system-wide immutable directory that might not be enumerable; it’s a place to store packages, but it’s hard to alter packages or find a package that the user hasn’t been told about.

Name analysis

We will need to know whether a name is used by a nested block. When we create an object representing that block, we will provide that object with each name that it uses. This is called name-use analysis or just use analysis and it is a type of name analysis. The two effects worth noting are when an expression uses a name and when a statement assigns to a name. We track the used names with a Set[Str]. For example, a keyword message uses a name if any of its arguments use a name:

class VMessage {
    has VKeyword @.keywords;
    method uses(--> Set[Str]) { [(|)@.keywords.map({ $_.uses }) }
}

A sequence of expressions has its usage computed backwards; every time an expression is assigned to a name, we let that assignment shadow any further uses by removing it from the set of used names. This can be written with reduce but it’s important to preserve readability since this sort of logic can be subtly buggy and often must be revisted during debugging.

class VExprs {
    has EmitStatement @.statements;
    has EmitLiteral $.rv;
    method uses(--> Set[Str]) {
        my $names = $.rv.uses;
        for @.statements.reverse {
            $names = ($names (-) $_.writes) (|) $_.call.uses;
        };
        $names;
    }
}

The .writes method merely produces the set of assigned names:

class VAssignName does EmitStatement {
    has Str $.target;
    method writes(--> Set[Str]) { Set[Str].new($.target) }
}

A second compiler

We now are ready to compile nested blocks. The overall workflow is to compute a closure for the inner block whose names are all used names in the block, except for parameters and global names. We rename everything in the closure with fresh symbols to avoid clashes and allow names like “self” to be closed over. We produce two scripts. One script accepts the closure’s values and attaches them to a new object; one script loads the closure and performs the action in the nested block upon the new object. We call into Vixen to allocate the prototype for the block, populate it, and intern it into the Nix store. Everything else is support code.

        my $closureNames = $uses (-) ($params (|) %globals);
        my %closure = $closureNames.keys.map: { $_ => $compiler.gensym ~ "_" ~ $_ };
        my $valueVerb = @.params ?? "value:" x @.params.elems !! "value";
        my $closureVerb = %closure ?? %closure.keys.map(* ~ ":").join !! "make";
        my $valueBlock = produceValueBlock($compiler, %closure, @.params, $.exprs);
        my $closureBlock = cloneForClosure($compiler, %closure);
        my $obj = callVixen(%*ENV, "clone:", $allocator);
        $compiler.writeBlock: $obj, $valueVerb, $valueBlock;
        $compiler.writeBlock: $obj, $closureVerb, $closureBlock;
        my $interned = callVixen(%*ENV, "intern:", $obj);

One hunk of support code is in the generation of the scripts with produceValueBlock and cloneForClosure. These are open-coded actions against the $compiler object:

sub cloneForClosure($compiler, %closure) {
    my $name = $compiler.genblock;
    $compiler.pushBlock: $name, %closure.keys;
    my $obj = $compiler.gensym;
    my $selfName = $compiler.useName: "self";
    $compiler.callBind: $obj, $selfName, "clone:", ($allocator,);
    my $rv = $compiler.useName: $obj;
    for %closure.kv -> $k, $v {
        my $arg = $compiler.useName: $k;
        $compiler.foreground: "redirfd -w 1 $rv/$v echo " ~ $arg;
    }
    $compiler.finishBlock: $rv;
    $name;
}
sub produceValueBlock($compiler, %closure, @params, $exprs) {
    my $name = $compiler.genblock;
    $compiler.pushBlock: $name, @params;
    my $selfName = $compiler.useName: "self";
    for %closure.kv -> $k, $v { $compiler.callBind: $k, $selfName, $v, [] };
    my $rv = $exprs.compileExprs: $compiler;
    $compiler.finishBlock: $rv;
    $name;
}

The compiler was augmented with methods for managing scopes of names and reassigning names, so that the define helper is no longer used at all. There’s also a method .writeBlock which encapsulates the process of writing out a script to disk.

class Compiler {
    has Hash[Str@.scopes;
    method assignName($from, $to) { @.scopes[*-1]{ $to } = $from }
    method useName($name) {
        for @.scopes.reverse {
            return "\$\{" ~ $_$name } ~ "\}" if $_$name }:exists;
        };
        die "Name $name not in scope!";
    }
    method writeBlock($obj, $verb, $blockName) {
        spurt "$obj/$verb", %.lines$blockName }.trim-leading;
        chmod 0o755, "$obj/$verb";
    }
}

Closing thoughts

This compiler is less jank than the typical compiler. There’s a few hunks of duplicated code, but otherwise the logic is fairly clean and direct. Raku supports a clean compiler mostly by requiring a grammar and an action class; I had started out by writing imperative spaghetti actions, and it was up to me to decide to organize further. To optimize, it might be worth virtualizing assignments so that there is only one convention for calls; this requires further bookkeeping to not only track renames but also name usage. Indeed, at that point, the reader is invited to consider what SSA might look like. Another possible optimization is to skip allocating empty closures for blocks which don’t close over anything.

It was remarkably easy to call into Vixen from Raku. I could imagine using the FFI as scaffolding to incrementally migrate this compiler to a self-hosting expression-language representation of itself. I could also imagine extending the compiler with FFI plugins that decorate or cross-cut compiler actions.

This blogpost is currently over 400 lines. The full compiler is under 400 lines. We put Raku into a jar with Unix, Smalltalk, and Nix; shook it up, and let it sit for a few days. The result is a humble compiler for a simple expression language with a respectable amount of spice. Thanks to the Raku community for letting me share my thoughts. To all readers, whether you’re Pastafarian or not, whether you prefer red sauce or white sauce or even pesto, I hope that you have a lovely and peaceful Holiday.

Day 14 – Taming Concurrency

Hello everyone and a merry advent time!

Today I’d like to show case a neat little mechanism that allows building concurrent applications without having to worry about the concurrency much. I’ve built this tool as part of the (yet to be released) Rakudo CI Bot. That’s a middle man that watches GitHub for things to test (both via API hooks that GitHub calls, as well as a regularly called puller), pushes test tasks to one or more CI backends (e.g. Azure), monitors those and pulls results once available and reports the results back to GitHub. It has to deal with a lot of externally triggered, thus naturally concurrent events.

So there is a fundamental need to bring synchronity into the processing of these events so things don’t step on each others toes. This is how I did it.

I’ve created components (simple singleton classes) that each is responsible for one of the incoming event sources. Those are:

  • The GitHubCITestRequester that receives CITests requests seen on GitHub both, via a poller and a web hook that GitHub itself calls.
  • The OBS backend that pushes test tasks to the Open Build Service and monitors it for test results.
  • The CITestSetManager that connects the components. It reads test tasks from the DB and sends them off to the test backends. It also monitors the results and acts whenever all tests for a task completes.

Data is persisted in a DB. Each table and field lies in the responsibility of one of these components. So naturally, because the responsibility of each data unit is clear the components are nicely separated. But the tools often need to hand work over to some other component. They do so by creating the respective DB entries (either new rows or filling out fields) and then notifying that other component that data is waiting to be processed.

This notification needs to check a few boxes:

  • The processing should start as soon as possible once work is waiting.
  • There should always only be a single worker processing the work of a component. This simplifies the design a lot as I don’t have to synchronize workers.
  • The notifiers shouldn’t be blocked.

I modelled the above via a method on the each of the above mentioned component classes called process-worklist. That method looks at the DB, retrieves all rows that want to be processed and does what needs to be done.

But this method can be called from multiple places and there is no guarantee that no worker is active when a call is done.

So I need a mechanism that ensures that

  • a call never blocks,
  • the method is run when it’s called and no run is in progress,
  • when a run is already in progress, queues another run right after the running one finishes,
  • but doesn’t pile up runs, because a single run will see and process all the work that was added in the mean time.

I achieved all of the above with a method trait called is serial-dedup. Add that trait to a method and it behaves exactly as described above. This is the code:

unit module SerialDedup;

my class SerialDedupData {
    has Semaphore $.sem is rw .= new(1);
    has $.run-queued    is rw = False;
}

my role SerialDedupStore {
    has SerialDedupData %.serial-dedup-store-variable;
}

multi sub trait_mod:<is>(Method $r, :$serial-dedupis export {
    my Lock $setup-lock .= new;
    $r.wrap(my method ($obj:) {
        my $d;
        $setup-lock.protect: {
            if !$obj.does(SerialDedupStore) {
                $obj does SerialDedupStore;
            }
            $d := $obj.serial-dedup-store-variable{$r.name//= SerialDedupData.new;
        }

        if $d.sem.try_acquire() {
            my &next = nextcallee;
            $d.run-queued = False;
            start {
                &next($obj);
                $d.sem.release();
                $obj.&$r() if $d.run-queued;
            }
        }
        else {
            $d.run-queued = True;
        }
    });
}

It works by wrapping the method the trait was added to. State is persisted in a field injected into the object by doesing a Role holding that field. The rest of the code is a pretty straight forward implementation of the required semantics. It tracks whether a call is running via the $.sem semaphore, and keeps a note whether another run was requested (by a concurrent call to the method) in the $.run-queued boolean. And
always runs the wrapped method in a separate thread.

All in all, I love how straight forward, easy and short the code doing this is. Especially when considering that this mechanism is the heart of the concurrency handling of a non-trivial application.

Day 13 – Christmas Crunching Part II

Christmas is nearly on us, 10 shopping days to go. Things at the pole were gathering pace, with so much left to do.

Rudolph (him again) was pacing up and down, mashing his cheroot. He cast his mind back to the App::Crag calcs he had done last time – sure all the distances, times and speeds had worked out. But something was still missing. Could he be sure that all the prerequisites were finalised, that the crucial flight would be a success again this year?

Then click, his nose lit up like a lightbulb – what about the fuel. Would they have enough juice in the powerplant to maintain optimium power output throughout the entire long night?

Once again, he cracked open his laptop and his hooves whirred away on the keys.

Present Power

He realised that vertical work against the pull of Earth’s gravity would be the key factor, Santa would need his winch to lower and raise him back up every chimney on the planet…

Green Christmas

New this year was the installation of a large, Lithium ion battery pack in the sleigh. Rudi wondered how many wind turbines would be needed in Lapland to charge up the battery and the capacity of the cabling required…

Rudolph nodded sagely and lit his pipe, it would be alright on the night after all.

Rudolph’s calm and cheery,
ready from the flight—
he’s puffing on his cherrywood pipe,
glowing in the night.

~librasteve


Credits

Some of the App::Crag features in play tonight were:

  • ?<some random LLM query>
  • ^<25 mph> – a standard crag unit
  • ?^<speed of a diving swallow in mph> – put them together to get units
  • 25km – a shortcut if you have simple SI prefixes and units
  • $answer = 42s – crag is just vanilla Raku with no strict applied

Checkout the crag-of-the-day for more – but beware, this is kinda strangely addictive.

Day 12 – Mathematician’s Yahtzee

Santa was playing and losing yet another game of Yahtzee with Mrs. Claus and the elves when a thought occurred to him: why don’t you get points for rolling the first 5 digits of the Fibonacci sequence (1, 1, 2, 3, 5)? For that matter, why isn’t there a mathematician’s version of Yahtzee where you get points for rolling all even numbers, all odd, etc. After a bit of brainstorming, Santa came up with the following rules:

  • Rolling all odd numbers replaces 3 of a kind (because ‘odd’ has 3 letters).
  • Rolling all even numbers replaces 4 of a kind (because ‘even’ has 4 letters).
  • Rolling pi (a 3, 1, and 4) replaces a full house.
  • Rolling all prime numbers replaces a small straight.
  • Rolling the Fibonacci sequence (1, 1, 2, 3, 5) replaces a large straight.
  • A Yahtzee (all five of the same number) is still a Yahtzee.

Santa and the others tried an experimental game with these rules, but it was more difficult than they expected. Everyone was so used to identifying a small/large straight, full house, etc, that it was hard to break those habits and identify the new patterns. So, Santa decided to write a Raku program to help him practice.

To start with, Santa created a function that uses the roll method to simulate rolling an arbitrary number of dice, then called it with 5 dice and printed the output:

sub roll-dice($n-dice) {
    return (1..6).roll($n-dice);
}

my @dice = roll-dice(5);
print-summary(@dice);

The print-summary function will be defined later.

Next up is allowing the user to re-roll an arbitrary number of dice up to two times. The user is prompted to enter which numbers they want to re-roll, using spaces to separate each number. To stop, the user can simply press enter. Getting input from the user is done via the prompt function.

for 1..2 -> $reroll-count {
    my $answer = prompt("> Which numbers would you like to re-roll? ");
    last if !$answer;
    my @indices;
    for $answer.split(/\s+/-> $number {
        my $index = @dice.first: $number, :k;
        if !$index.defined {
            note "Could not find number $number";
            exit 1;
        }
        @indices.push: $index;
        @dice[$index]:delete;
    }
    for @indices -> $index {
        @dice[$index= roll-dice(1).first;
    }
    print-summary(@diceif $reroll-count == 1;
}

print-summary(@dice);

A few notes on the above code:

  • Once a user’s input is received, it’s split on whitespace, looping over each number.
  • The first method finds the first die with that number, while the :k adverb makes first return the index of that number (rather than the number itself).
  • If the number isn’t found, note is used to print to standard error and then the program exits.
  • The index is tracked by pushing it onto an array, and then the die value is deleted.
  • Using the delete adverb on an array creates a hole at that index. That’s okay, though, because the next loop fills it in with a new roll. Note that the roll-dice function always returns a Seq, even when rolling a single die, so the first method is used to get the first (and only) value from the sequence.

At this point Santa is halfway done. He doesn’t just want to print out the dice that were rolled, though, he also wants to identify all the new rules the dice match. To start with, identifying a Yahtzee in an array of dice is simple: just count the number of unique values. If it’s one, it’s a Yahtzee:

@dice.unique.elems == 1

For a Fibonacci sequence, Santa uses a Bag, which is a collection that keeps track of duplicate values. Two bags can be compared with the eqv (equivalence) operator. (Santa learned the hard way you can’t use == here because it turns a Bag into a Numeric, which would be the number of elements in the Bag).

@dice.Bag eqv (1, 1, 2, 3, 5).Bag

For pi, Santa brushes up on his rusty set theory. After some research, he makes use of the ⊂ (subset of)  operator, which returns true if all the left-side elements are in the right-side elements (and the right side must have more elements, which it will since there are 5 dice).

(3, 1, 4)  @dice

All prime numbers are easily identified by combining the all method with is-prime. This creates an all Junction that runs is-prime on each value and collapses into a single boolean.

@dice.all.is-prime

Something similar can be done for all even numbers, this time using the %% (divisibility) operator, which returns true if the left side is evenly divisible by the right side.

@dice.all %% 2

For all odd numbers, none is used instead of all.

@dice.none %% 2

Finally, putting it all together into a function results in this:

sub print-summary(@dice where *.elems == 5) {
    with my @summary {
        .push: 'yahtzee'   if @dice.unique.elems == 1;
        .push: 'fibonacci' if @dice.Bag eqv (1, 1, 2, 3, 5).Bag;
        .push: 'pi'        if (3, 1, 4)  @dice;
        .push: 'all-prime' if @dice.all.is-prime;
        .push: 'all-even'  if @dice.all %% 2;
        .push: 'all-odd'   if @dice.none %% 2;
    };
    my $output = @dice.join(' ');
    if ( @summary ) {
        $output ~= " (@summary.join(', '))";
    }
    say $output;
}

Note the signature to the function uses a type constraint (where *.elems == 5) to prove there are always 5 dice; an error is thrown if there are not 5 elements.

Here’s a sample run of the program:

$ raku math-yahtzee.raku
3 3 1 6 2
> Which numbers would you like to re-roll? 3 6
4 3 1 5 2 (pi)
> Which numbers would you like to re-roll? 4
1 3 1 5 2 (fibonacci)

This is exactly what Santa wanted to help him practice Mathematician’s Yahtzee! He had a lot of fun writing the program, especially the print-summary function, and is especially impressed that the program did not need to use any modules. There’s plenty more that could be done to improve this program, but Santa will stop here.

Day 11 – Raku To The Stars

Datastar is a hypermedia systems library in the same tradition as htmx, Unpoly, Alpine AJAX, and Hotwire‘s Turbo. These libraries are generally Javascript/Typescript bundles that utilize the HTML standard to allow you to declaratively write AJAX calls or JS-powered CSS transitions as HTML tag attributes instead of hand-writing Javascript. @librasteve has been working on Air and the HARC stack which seeks to deeply integrate htmx into the Raku ecosystem, and so I highly recommend reading his posts to get a better understanding of why Hypermedia always been a compelling sell.

htmx in particular makes no prescription on handling browser-side, component-local state. Carson Gross, the creator of htmx, lists Alpine, his own project hyperscript, Web Component frameworks such as lit, and several others as options depending on the use case.

A Datastar Primer

Datastar does not take this approach; it aims to handle both browser-side state using signals and the server-side connectivity which htmx and Phoenix LiveView do. The main differentiating factor for Datastar is that it automatically handles Server-Sent Events (SSE) and text/event-stream responses, making it really good for real-time applications. Datastar also allows you to return a regular text/html response just like htmx; it morphs the HTML fragment into the DOM using a forked version of the same DOM-morphing library htmx uses. Datastar also accepts a application/json response from the server which it uses to patch signals (which are JSON objects), and it also accepts a text/javascript response from the server, which it uses to run custom Javascript on the browser.

Raku ❤ Datastar

Finally realizing I don’t have to write any React code in my side projects, I wrote a Datastar library in Raku, combining the two things I’ve recently taking a liking to.

Raku’s best on display

Multimethods and Multisubs

Let’s take a walk through the actual code and see how I’ve utilized Raku’s expressivity, starting with the use of multi subs and multi methods:

multi patch-signals(
    Str $signals, 
    Str :$event-id, 
    Bool :$only-if-missing, 
    Int :$retry-duration
is export {
    ...
}

multi patch-signals(%signals, *%options) {
    samewith to-json(%signals, :!pretty), |%options;
}

The reason we use multi subs here is to allow for the possibility that the signals may be serialized into a Str before patch-signals is called. However, assuming an Associative is passed as the first argument, we just turn it into a Str and call the same function.

Metaoperators

Let’s also take a look at the body of patch-signals, starting with this:

my @signal-lines = 
    "{SIGNALS-DATALINE-LITERAL} " X$signals.split("\n");

I love this line. Let’s start with what this code does; it prepends the word signal to every line of actual stringified JSON. So for example if we have this as our sample stringified JSON:

{
    "selected-element-index": 0,
    "selected-element-value": "Nominative"
}

The actual output of the first line will the following split per-line:

signal {
signal     "selected-element-index": 0,
signal     "selected-element-value": "Nominative"
signal } 

which is what Datastar expects when parsing signals being sent downstream from the server. We’re using X~ the same way Bruce Gray outlines it here, as a scaling/map operator to prepend "signal" to each line of the stringified JSON, an incredibly clever use of the cross-product meta-operator.

Another meta-operator we use is the reduction metaoperator [], mainly to join the lines of the resulting SSE response into one string:

class SseGen {
    method Str { [~@!data-lines }
}

DSL Building using dynamic variables and blocks

Let’s examine patch-signals‘s first statement:

fail 'You can only call this method inside a datastar { } block' 
    unless $*INSIDE-DATASTAR-RESPONSE-GENERATOR;

We use dynamic variables to help enforce the usage of these functions within the datastar block as shown here:

datastar {
    patch-signals { current-index => 0 };
}

sub datastar is defined as:

sub datastar(&fis export {
    my $*INSIDE-DATASTAR-RESPONSE-GENERATOR = True;
    my $*render-as-supply = False;
    my SseGen $*response-generator .= new;

    f();

    $*render-as-supply  
      ?? $*response-generator.Supply 
      !! $*response-generator.Str
}

We make use of blocks and dynamic variables here to have an implicit variable be passed down to functions that make use of it, so that we get very easy to read code. This is largely inspired from what I saw in Matthew Stuckwisch’s presentation here.

A Few More Ideas

Another improvement I had in mind was modifying the multi functions so that you could theoretically write code in a functional manner using the feed operators like this if you wanted:

SseResponse.new
    ==> patch-elements("<div></div>")
    ==> patch-signals({ a => 2, b => 3 })
    ==> as-supply();

You would theoretically import this functionality by adding the following to the top of your file: use DataStar :functional;

Another idea that I had was the following:

given SseGen.new {
    .patch-elements: "<div>Hello there</div>";
    .patch-signals: { a => 2, b => 3 };

    .Supply
}

This makes use of the given control structure which assigns an expression to the topic variable $_, and Raku’s syntax sugar for calling methods on $_ which is to just invoke method without a receiver/object.

I hope to make it easier for developers using this library to follow the TMTOWTDI philosophy by providing multiple strategies for interfacing with the library.

Next Steps

Here are the next steps regarding this library:

  • Run the full Datastar test suite against this library to make sure I’m on the same page.
  • Rename DataStar to Datastar. Datastar is the preferred name for the package and I incorrectly named it when I was in a rush to release this package.
  • Integrate this with Cro through a new package Cro::Datastar and integrate it with Air via a new package: Air::Jetstream.

Happy holidays everyone!

Day 10 – Santa’s Finance Department

Cron Meets Raku

The Finance Department’s computers had been converted to Debian Linux with Rakuized software along with all the other departments at the North Pole headquaters, and its employees enjoyed the Windows-free environment. However, an inspection by a firm hired to evaluate efficiency practices found some room for improvement.

Much of the work day was spent perusing email (as well as some postal mail) and entering purchase and payment information into the accounting software. The review team suggested that more automation was possible by investing in programs to (1) extract the information from the emails and (2) use optical character recognition (OCR) on digital scans of paper invoices.

The review team briefed Santa Claus and his department heads after their work was finished. After the team departed, Santa asked the IT department to assist the finance department in making the improvements.

Note the IT department is now using ChatGPT as a programming aid, so some of the new projects rely on it heavily for assistance in areas of little expertise as well as handling boiler plate (e.g., boring) coding. But any code used is tested thouroughly.

Extracting data from email

Gmail is the email system used currently with an address of “finance.santa@gmail.com” for the department. All bills and correspondence with external vendors use that address.

Normally Raku would be the programming language of choice, but Python is used for the interaction with Gmail because Google has well-defined Python APIs supported by Google.

In order to access Gmail programmatically, we need a secret token for our user. Following is the one-time interactive process using Python:

cd /path/to/gmail-finance-ingest # a directory to contain most Python code
python3 -m venv .venv
. .venv/bin/activate
pip install .
gmail-finance-ingest init-auth \
  --credentials=/home/finance/secret/google/credentials.json \
  --token=/home/finance/secret/google/token.json

That launches the browser, the user approves access, that token is saved. After that, no more interaction is needed; cron (the Linux scheduler) can use the same token.

In order to handle the mail approriately, we use a yaml file to identify expected mail and its associated data as shown here in example file config.yml:

data_root: /home/finance/gmail-bills

sources:
  - name: city-utilities
    gmail_query: 'from:(billing@mycity.gov) has:attachment filename:pdf'
    expect: pdf
    subdir: city-utilities

  - name: electric-utility
    gmail_query: 'from:(noreply@powerco.com) subject:(Your bill) has:attachment filename:pdf'
    expect: pdf
    subdir: electric-utility

  - name: amazon
    gmail_query: 'from:(order-update@amazon.com OR auto-confirm@amazon.com)'
    expect: email
    subdir: amazon

Following is the bash script to handle the finance department’s config file:

. /home/finance/path/to/gmail-finance-ingest/.venv/bin/activate
gmail-finance-ingest sync \
  --config=/home/finance/gmail-finance-config.yml \
  --credentials=/home/finance/secret/google/credentials.json \
  --token=/home/finance/secret/google/token.json

Automating the process

Linux cron is used to automate various email queries, thus saving a lot of manual, boring work by staff.

cron is a time-based job scheduler in Linux and other Unix-like operating systems. It enables users to schedule commands or scripts (known as cron jobs) to run automatically at specific times, dates, or intervals.

The driver program is a Python package named gmail_finance_ingest.

Here is the bash script used to operate on emails:

#!/bin/bash
set -e

LOGDIR="$HOME/log"
mkdir -p "$LOGDIR"

# Activate venv
. "$HOME/path/to/gmail-finance-ingest/.venv/bin/activate"

gmail-finance-ingest sync \
  --config="$HOME/gmail-finance-config.yml" \
  --credentials="$HOME/secret/google/credentials.json" \
  --token="$HOME/secret/google/token.json" \
  >> "$LOGDIR/gmail-sync.log" 2>&1

Following is the cron code used to update email scans daily:

15 3 * * * /home/finance/bin/run-gmail-sync.sh

For processing the data we handle several types which are identified in the expected emails and identified in the config file by the keywords shown below:

  1. text embedded in the mail – expect=email
  2. PDF attachments – expect=pdf
  3. attachments or enclosed chunks of scanned documents – expect=ocr

Type 3 is not yet handled.

The collected data is parsed by type and the pertinent output is placed in CSV tables for bookkeeping purposes. Such tables can be used as source for Linux accounting programs like GnuCash. The department has been using that program sincs the big Linux/Debian transition.

Emails which cannot be evaluated by machine are automatically forwarded to designated department staff to handle manually.

Other work

The IT folks have other projects not formally published yet, but some are in final testing stage and are usable now. See the summary below for related Raku projects such as an Access-like database program and a check-writing program.

Summary

The products mentioned above are still works-in-progress, but their development can be followed on GitHub now at:

An automated email interrogator for known sendees.

A program to print a business-size check on a standard single-check form available from ?

An Access-like relational database management system capable of using CSV tables as a backing store.

Epilogue

Don’t forget the “reason for the season:” ✝

As I always end these jottings, in the words of Charles Dickens’ Tiny Tim, “may God bless Us, Every one!” [2]

Footnotes

A Christmas Carol, a short story by Charles Dickens (1812-1870), a well-known and popular Victorian author whose many works include The Pickwick Papers, Oliver Twist, David Copperfield, Bleak House, Great Expectations, and A Tale of Two Cities.

Day 9 – Monadic programming examples

Introduction

This document (notebook) provides examples of monadic pipelines for computational workflows in Raku. It expands on the blog post “Monad laws in Raku”, [AA2], (notebook), by including practical, real-life examples.

Context

As mentioned in [AA2], here is a list of the applications of monadic programming we consider:

  1. Graceful failure handling
  2. Rapid specification of computational workflows
  3. Algebraic structure of written code

Remark: Those applications are discussed in [AAv5] (and its future Raku version.)

As a tools maker for Data Science (DS) and Machine Learning (ML), [AA3],
I am very interested in Point 1; but as a “simple data scientist” I am mostly interested in Point 2.

That said, a large part of my Raku programming has been dedicated to rapid and reliable code generation for DS and ML by leveraging the algebraic structure of corresponding software monads, i.e. Point 3. (See [AAv2, AAv3, AAv4].) For me, first and foremost, monadic programming pipelines are just convenient interfaces to computational workflows. Often I make software packages that allow “easy”, linear workflows that can have very involved computational steps and multiple tuning options.

Dictionary

  • Monadic programming
    A method for organizing computations as a series of steps, where each step generates a value along with additional information about the computation, such as possible failures, non-determinism, or side effects. See [Wk1].
  • Monadic pipeline
    Chaining of operations with a certain syntax. Monad laws apply loosely (or strongly) to that chaining.
  • Uniform Function Call Syntax (UFCS)
    A feature that allows both free functions and member functions to be called using the same object.function() method call syntax.
  • Method-like call
    Same as UFCS. A Raku example: [3, 4, 5].&f1.$f2.

Setup

Here are loaded packages used in this document (notebook):

use Data::Reshapers;
use Data::TypeSystem;
use Data::Translators;

use DSL::Translators;
use DSL::Examples;

use ML::SparseMatrixRecommender;
use ML::TriesWithFrequencies;

use Hilite::Simple;


Prefix trees

Here is a list of steps:

  • Make a prefix tree (trie) with frequencies by splitting words into characters over @words2
  • Merge the trie with another trie made over @words3
  • Convert the node frequencies into probabilities
  • Shrink the trie (i.e. find the “prefixes”)
  • Show the tree-form of the trie

Let us make a small trie of pet names (used by Raku or Perl fans):

my @words1 = random-pet-name(*)».lc.grep(/ ^ perl /);
my @words2 = random-pet-name(*)».lc.grep(/ ^ [ra [k|c] | camel ] /);

Here we make a trie (prefix tree) for those pet names using the feed operator and the functions of “ML::TriesWithFrequencies”, [AAp5]:

@words1 ==> 
trie-create-by-split==>
trie-merge(@words2.&trie-create-by-split) ==>
trie-node-probabilities==>
trie-shrink==>
trie-say

TRIEROOT => 1
├─camel => 0.10526315789473684
│ ├─ia => 0.5
│ └─o => 0.5
├─perl => 0.2631578947368421
│ ├─a => 0.2
│ ├─e => 0.2
│ └─ita => 0.2
└─ra => 0.631578947368421
  ├─c => 0.75
  │ ├─er => 0.2222222222222222
  │ ├─he => 0.5555555555555556
  │ │ ├─al => 0.2
  │ │ └─l => 0.8
  │ │   └─  => 0.5
  │ │     ├─(ray ray) => 0.5
  │ │     └─ray => 0.5
  │ ├─ie => 0.1111111111111111
  │ └─ket => 0.1111111111111111
  └─k => 0.25
    ├─i => 0.3333333333333333
    └─sha => 0.6666666666666666

Using andthen and the Trie class methods (but skipping node-probabilities calculation in order to see the counts):

@words1
andthen .&trie-create-by-split
andthen .merge( @words2.&trie-create-by-split )
# andthen .node-probabilities
andthen .shrink
andthen .form

TRIEROOT => 19
├─camel => 2
│ ├─ia => 1
│ └─o => 1
├─perl => 5
│ ├─a => 1
│ ├─e => 1
│ └─ita => 1
└─ra => 12
  ├─c => 9
  │ ├─er => 2
  │ ├─he => 5
  │ │ ├─al => 1
  │ │ └─l => 4
  │ │   └─  => 2
  │ │     ├─(ray ray) => 1
  │ │     └─ray => 1
  │ ├─ie => 1
  │ └─ket => 1
  └─k => 3
    ├─i => 1
    └─sha => 2


Data wrangling

One appealing way to show that monadic pipelines result in clean and readable code, is to demonstrate their use in Raku through data wrangling operations. (See the “data packages” loaded above.) Here we get the Titanic dataset, show its structure, and show a sample of its rows:

#% html
my @dsTitanic = get-titanic-dataset();
my @field-names = <id passengerClass passengerSex passengerAge passengerSurvival>;

say deduce-type(@dsTitanic);

@dsTitanic.pick(6) 
==> to-html(:@field-names)

Vector(Assoc(Atom((Str)), Atom((Str)), 5), 1309)

idpassengerClasspassengerSexpassengerAgepassengerSurvival
9603rdmale30died
1831stfemale30survived
10433rdfemale-1survived
1651stmale40survived
8913rdmale20died
8063rdmale-1survived

Here is an andthen data wrangling monadic pipeline, the lines of which have the following interpretations:

  • Initial pipeline value (the dataset)
  • Rename columns
  • Filter rows (with age greater or equal to 10)
  • Group by the values of the columns “sex” and “survival”
  • Show the structure of the pipeline value
  • Give the sizes of each group as a result
@dsTitanic 
andthen rename-columns($_,  {passengerAge => 'age', passengerSex => 'sex', passengerSurvival => 'survival'})
andthen $_.grep(*<age> ≥ 10).List
andthen group-by($_, <sex survival>)
andthen {say "Dataset type: ", deduce-type($_); $_}($_)
andthen $_».elems

Dataset type: Struct([female.died, female.survived, male.died, male.survived], [Array, Array, Array, Array])


{female.died => 88, female.survived => 272, male.died => 512, male.survived => 118}

Remark: The andthen pipeline corresponds to the R pipeline in the next section.

Similar result can be obtained via cross-tabulation and using a pipeline with the feed (==>) operator:

@dsTitanic
==> { .grep(*<passengerAge> ≥ 10) }()
==> { cross-tabulate($_, 'passengerSex', 'passengerSurvival') }()
==> to-pretty-table()

+--------+----------+------+
|        | survived | died |
+--------+----------+------+
| female |   272    |  88  |
| male   |   118    | 512  |
+--------+----------+------+

Tries with frequencies can be also used for finding this kind of (deep) contingency tensors (not just some shallow tables):

@dsTitanic
andthen $_.map(*<passengerSurvival passengerSex passengerClass>)
andthen .&trie-create
andthen .form

TRIEROOT => 1309
├─died => 809
│ ├─female => 127
│ │ ├─1st => 5
│ │ ├─2nd => 12
│ │ └─3rd => 110
│ └─male => 682
│   ├─1st => 118
│   ├─2nd => 146
│   └─3rd => 418
└─survived => 500
  ├─female => 339
  │ ├─1st => 139
  │ ├─2nd => 94
  │ └─3rd => 106
  └─male => 161
    ├─1st => 61
    ├─2nd => 25
    └─3rd => 75

Remark: This application of Tries with frequencies can be leveraged in making mosaic plots. (See this MosaicPlot implementation in Wolfram Language, [AAp6].)


Data wrangling code with multiple languages and packages

Let us demonstrate the rapid specification of workflows application by generating data wrangling code from natural language commands. Here is a natural language workflow spec (each row corresponds to a pipeline segment):

my $commands = q:to/END/;
use dataset dfTitanic;
rename columns passengerAge as age, passengerSex as sex, passengerClass as class;
filter by age ≥ 10;
group by 'class' and 'sex';
counts;
END

Grammar based interpreters

Here is a table with the generated codes for different programming languages according to the spec above (using “DSL::English::DataQueryWorkflows”, [AAp3]):

#% html
my @tbl = <Python R Raku WL>.map({ %( language => $_, code => ToDSLCode($commands, format=>'code', target => $_) ) });
to-html(@tbl, field-names => <language code>, align => 'left').subst("\n", '<br>', :g)

Executing the Raku pipeline (by replacing dfTitanic with @dsTitanic first):

my $obj = @dsTitanic;
$obj = rename-columns( $obj, %("passengerAge" => "age", "passengerSex" => "sex", "passengerClass" => "class") ) ;
$obj = $obj.grep({ $_{"age"} >= 10 }).Array ;
$obj = group-by($obj, ("class", "sex")) ;
$obj = $obj>>.elems

{1st.female => 132, 1st.male => 149, 2nd.female => 96, 2nd.male => 149, 3rd.female => 132, 3rd.male => 332}

That is not monadic, of course — see the monadic version above.


LLM generated (via DSL examples)

Here we define an LLM-examples function for translation of natural language commands into code using DSL examples (provided by “DSL::Examples”, [AAp6]):

my sub llm-pipeline-segment($lang, $workflow-name = 'DataReshaping') { llm-example-function(dsl-examples(){$lang}{$workflow-name}) };

Here is the LLM translated code:

my $code = llm-pipeline-segment('Raku', 'DataReshaping')($commands)

use Data::Reshapers; use Data::Summarizers; use Data::TypeSystem
my $obj = @dfTitanic;
$obj = rename-columns($obj, %(passengerAge => 'age', passengerSex => 'sex', passengerClass => 'class'));
$obj = $obj.grep({ $_{'age'} >= 10 }).Array;
$obj = group-by($obj, ('class', 'sex'));
$obj = $obj>>.elems;

Here the translated code is turned into monadic code by string manipulation:

my $code-mon =$code.subst(/ $<lhs>=('$' \w+) \h+ '=' \h+ (\S*)? $<lhs> (<-[;]>*) ';'/, {"==> \{{$0}\$_{$1} \}()"} ):g;
$code-mon .= subst(/ $<printer>=[note|say] \h* $<lhs>=('$' \w+) ['>>'|»] '.elems' /, {"==> \{$<printer> \$_>>.elems\}()"}):g;

use Data::Reshapers; use Data::Summarizers; use Data::TypeSystem
my $obj = @dfTitanic;
==> {rename-columns($_, %(passengerAge => 'age', passengerSex => 'sex', passengerClass => 'class')) }()
==> {$_.grep({ $_{'age'} >= 10 }).Array }()
==> {group-by($_, ('class', 'sex')) }()
==> {$_>>.elems }()

Remark: It is believed that the string manipulation shown above provides insight into how and why monadic pipelines make imperative code simpler.


Recommendation pipeline

Here is a computational specification for creating a recommender and obtaining a profile recommendation:

my $spec = q:to/END/;
create from @dsTitanic; 
apply LSI functions IDF, None, Cosine; 
recommend by profile for passengerSex:male, and passengerClass:1st;
join across with @dsTitanic on "id";
echo the pipeline value;
END

Here is the Raku code for that spec given as an HTML code snipped with code-highlights:

#%html
ToDSLCode($spec, default-targets-spec => 'Raku', format => 'code')
andthen .subst('.', "\n.", :g)
andthen hilite($_)

my $obj = ML::SparseMatrixRecommender
.new
.create-from-wide-form(@dsTitanic)
.apply-term-weight-functions(global-weight-func => "IDF", local-weight-func => "None", normalizer-func => "Cosine")
.recommend-by-profile(["passengerSex:male", "passengerClass:1st"])
.join-across(@dsTitanic, on => "id" )
.echo-value()

Here we execute a slightly modified version of the pipeline (based on “ML::SparseMatrixRecommender”, [AAp7]):

my $obj = ML::SparseMatrixRecommender.new
.create-from-wide-form(@dsTitanic)
.apply-term-weight-functions("IDF", "None", "Cosine")
.recommend-by-profile(["passengerSex:male", "passengerClass:1st"])
.join-across(@dsTitanic, on => "id" )
.echo-value(as => {to-pretty-table($_, )} )

+----------------+-----+--------------+-------------------+----------+--------------+
| passengerClass |  id | passengerAge | passengerSurvival |  score   | passengerSex |
+----------------+-----+--------------+-------------------+----------+--------------+
|      1st       |  10 |      70      |        died       | 1.000000 |     male     |
|      1st       | 101 |      50      |      survived     | 1.000000 |     male     |
|      1st       | 102 |      40      |        died       | 1.000000 |     male     |
|      1st       | 107 |      -1      |        died       | 1.000000 |     male     |
|      1st       |  11 |      50      |        died       | 1.000000 |     male     |
|      1st       | 110 |      40      |      survived     | 1.000000 |     male     |
|      1st       | 111 |      30      |        died       | 1.000000 |     male     |
|      1st       | 115 |      20      |        died       | 1.000000 |     male     |
|      1st       | 116 |      60      |        died       | 1.000000 |     male     |
|      1st       | 119 |      -1      |        died       | 1.000000 |     male     |
|      1st       | 120 |      50      |      survived     | 1.000000 |     male     |
|      1st       | 121 |      40      |      survived     | 1.000000 |     male     |
+----------------+-----+--------------+-------------------+----------+--------------+


Functional parsers (multi-operation pipelines)

In can be said that the package “FunctionalParsers”, [AAp4], implements multi-operator monadic pipelines for the creation of parsers and interpreters. “FunctionalParsers” achieves that using special infix implementations.

use FunctionalParsers :ALL;
my &p1 = {1} ⨀ symbol('one');
my &p2 = {2} ⨀ symbol('two');
my &p3 = {3} ⨀ symbol('three');
my &p4 = {4} ⨀ symbol('four');
my &pH = {10**2} ⨀ symbol('hundred');
my &pT = {10**3} ⨀ symbol('thousand');
my &pM = {10**6} ⨀ symbol('million');
sink my &pNoun = symbol('things') ⨁ symbol('objects');

Here is a parser — all three monad operations (⨁, ⨂, ⨀) are used:

# Parse sentences that have (1) a digit part, (2) a multiplier part, and (3) a noun
my &p = (&p1 ⨁ &p2 ⨁ &p3 ⨁ &p4) ⨂ (&pT ⨁ &pH ⨁ &pM) ⨂ &pNoun;

# Interpreter:
# (1) flatten the parsed elements
# (2) multiply the first two elements and make a sentence with the third element
sink &p = { "{$_[0] * $_[1]} $_[2]"} ⨀ {.flat} ⨀ &p 

Here the parser is applied to different sentences:

['three million things', 'one hundred objects', 'five thousand things']
andthen .map({ &p($_.words.List).head.tail })
andthen (.say for |$_)

3000000 things
100 objects
Nil

The last sentence is not parsed because the parser &p knows only the digits from 1 to 4.


References

Articles, blog posts

[Wk1] Wikipedia entry: Monad (functional programming), URL: https://en.wikipedia.org/wiki/Monad_(functional_programming) .

[Wk2] Wikipedia entry: Monad transformer, URL: https://en.wikipedia.org/wiki/Monad_transformer .

[H1] Haskell.org article: Monad laws, URL: https://wiki.haskell.org/Monad_laws.

[SH2] Sheng Liang, Paul Hudak, Mark Jones, “Monad transformers and modular interpreters”, (1995), Proceedings of the 22nd ACM SIGPLAN-SIGACT symposium on Principles of programming languages. New York, NY: ACM. pp. 333–343. doi:10.1145/199448.199528.

[PW1] Philip Wadler, “The essence of functional programming”, (1992), 19’th Annual Symposium on Principles of Programming Languages, Albuquerque, New Mexico, January 1992.

[RW1] Hadley Wickham et al., dplyr: A Grammar of Data Manipulation, (2014), tidyverse at GitHub, URL: https://github.com/tidyverse/dplyr .
(See also, http://dplyr.tidyverse.org .)

[AA1] Anton Antonov, “Monad code generation and extension”, (2017), MathematicaForPrediction at WordPress.

[AA2] Anton Antonov, “Monad laws in Raku”, (2025), RakuForPrediction at WordPress.

[AA3] Anton Antonov, “Day 2 – Doing Data Science with Raku”, (2025), Raku Advent Calendar at WordPress.

Packages, paclets

[AAp1] Anton Antonov, MonadMakers, Wolfram Language paclet, (2023), Wolfram Language Paclet Repository.

[AAp2] Anton Antonov, StatStateMonadCodeGeneratoreNon, R package, (2019-2024),
GitHub/@antononcube.

[AAp3] Anton Antonov, DSL::English::DataQueryWorkflows, Raku package, (2020-2024),
GitHub/@antononcube.

[AAp5] Anton Antonov, ML::TriesWithFrequencies, Raku package, (2021-2024),
GitHub/@antononcube.

[AAp6] Anton Antonov, DSL::Examples, Raku package, (2024-2025),
GitHub/@antononcube.

[AAp7] Anton Antonov, ML::SparseMatrixRecommender, Raku package, (2025),
GitHub/@antononcube.

[AAp8] Anton Antonov, MosaicPlot, Wolfram Language paclet, (2023), Wolfram Language Paclet Repository.

Videos

[AAv1] Anton Antonov, Monadic Programming: With Application to Data Analysis, Machine Learning and Language Processing, (2017), Wolfram Technology Conference 2017 presentation. YouTube/WolframResearch.

[AAv2] Anton Antonov, Raku for Prediction, (2021), The Raku Conference 2021.

[AAv3] Anton Antonov, Simplified Machine Learning Workflows Overview, (2022), Wolfram Technology Conference 2022 presentation. YouTube/WolframResearch.

[AAv4] Anton Antonov, Simplified Machine Learning Workflows Overview (Raku-centric), (2022), Wolfram Technology Conference 2022 presentation. YouTube/@AAA4prediction.

[AAv5] Anton Antonov, Applications of Monadic Programming, Part 1, Questions & Answers, (2025), YouTube/@AAA4prediction.

Day 8 – HARC The Herald Angels Sing

Rudoph had long wanted wanted to write a website – he longed to share his hobbies and opinions with all the children, so that they wouldn’t just think of him as a first class pilot and navigator. He knew about Raku and he had skim read some information about Cro and Humming-Bird. But, being quite lazy, he wanted something very, very easy that he could use to whip up a site in a few lines.

He had overheard Dasher and Vixen talking behind the bike shed about a new Raku web authoring tool – HARC – and that sounded more in tune with his thinking.

HARC! the herald angels sing,
“Glory to the newborn King:
peace on earth, and mercy mild,
God and sinners reconciled!”

First Footing

He gave it a go, following the Getting Started info in the Raku Air::Examples module, for Air is the glue that puts the A in HARC.

zef install --/test cro Air
git clone https://github.com/librasteve/Air-Examples.git
cd Air-Examples/bin
raku 00-nano.raku

He pointed a browser at localhost:3000 and his nose lit up!

Pawing the Snow

He made a copy of 00-nano.raku and renamed it 20-rudolph.raku, then he added his name in the obvious place:

#!/usr/bin/env raku

use Air::Functional :BASE;
use Air::Base;

my $nano =
    site
        page
            main
                p "Yo rudi!";

$nano.serve;

His hooves typed raku 20-rudolph.raku and here’s what he saw in his browser

Oh my sweet Santa Claus, that’s a whole webpage (a whole website actually) in 5 lines of code. That going to save a lot of effort and bring back the -Ofun to web development.

Editors Note: Rudolph feels that HTML template systems such as Cro template or Template::Mustache or Template6 (there are many more on raku.land) are a very good idea when there is a big project with many unskilled young elves who can update web templates with little knowledge of real coding languages. However, this does not apply to an experienced reindeer like him who wants all the power of a fully featured programming language and to avoid faffing around with all those angle brackets.

Walking On

His hooves began to clack away on the keyboard:

#!/usr/bin/env raku

use Air::Functional :BASE;
use Air::Base;

my $rudi = site page main [
    section [
        h2 'About Me';
        p 'Hello! I\'m Rudolph, a curious builder who loves working on small tools, playful experiments, and simple things that make life easier. I enjoy long walks, warm drinks, and the feeling of figuring something out after staring at it way too long.';
    ];
    section [
        h2 'Projects';
        ul [
            li [ strong 'ChimeBox:'' a tiny notification app that whispers instead of buzzes.' ];
            li [ strong 'TrailMapper:'' a map tool for discovering quiet paths around my city.' ];
            li [ strong 'CookieCrunch:'' a deliberately pointless game about collecting virtual cookies.' ];
        ];
    ];
    section [
        h2 'Contact';
        p 'If you\'d like to say hello, send a message via ', em 'rudolph@example.com';
    ];
];

$rudi.serve;

Hmmm – a neat way to set out the contact right there in functional style Raku source (using Air::Functional).

All the content is done, but the style is a bit so-so…

A Rising Trot

Let’s change those section to article tags (for our Rudi has checked the PicoCSS preset semantic tags) and add a splash of colour in the footer:

#!/usr/bin/env raku

use Air::Functional :BASE;
use Air::Base;

my $rudi = site page [
    header [
        h1 'Rudolph';
        p 'Developer • Tinkerer • Occasional Cookie Enthusiast';
    ];
    main [
        article [
            h2 'About Me';
            p 'Hello! I\'m Rudolph, a curious builder who loves working on small tools, playful experiments, and simple things that make life easier. I enjoy long walks, warm drinks, and the feeling of figuring something out after staring at it way too long.';
        ];
        article [
            h2 'Projects';
            ul [
                li [ strong 'ChimeBox:'' a tiny notification app that whispers instead of buzzes.' ];
                li [ strong 'TrailMapper:'' a map tool for discovering quiet paths around my city.' ];
                li [ strong 'CookieCrunch:'' a deliberately pointless game about collecting virtual cookies.' ];
            ];
        ];
        article [
            h2 'Contact';
            p 'If you\'d like to say hello, send a message via ', b 'rudolph@example.com';
        ];
    ];
    footer
        p [safe '&copy; 2025'; b 'Rudolph.''All rights reserved.' ];
];

$rudi.serve;

Very dashing splash of red:

Editor’s Note: Pico CSS already defines a coherent set of styles for all the tags used so far … so no need to decorate our content code with e.g. Tailwind (unless you want to).

But his new header was pretty so-so…

At a Gallop

It was a short job to curry the Air::Base header routine with one of his own:

my &rude-header = &header.assuming(
    :style( q:to/END/;
        background: #b30000;
        color: white;
        padding: 2rem;
        text-align: center;
    END
    ),
);

And voila:

Stable End

Rudolph looked on with quiet satisfaction at his work and fired up his holly-wood churchwarden pipe.

Rudolph, bright and cozy,
amid the tinsel light,
puffs upon a churchwarden pipe
that glows like Yuletide night.

Find out if things can get better next time…

~librasteve


Day 7 – Allowing for fewer dollars

Lizzybel had been taking a bit of vacation from all of the busy-ness in the corridors of North Pole Grand Central.

While doing a small visit to the corridors, she ran into Nanunanu, one of the IT elves.  Nanunanu was a bit worried, because they had not seen Lizzybel for a while. “Don’t worry”, said Lizzybel. “I’m just recharching my batteries a bit while doing some other stuff that has been neglected by me for a while. But I have been following developments from a distance, to stay at least a bit in the loop”, Lizzybel said with a bit of a grin. “Ah, ok”, said Nanunanu, “anything particular that caught your eye?”.

“Now that you mention it: it looks like quite a few of potential users of the Raku Programming Language are put off by the use of sigils in variable declarations, specifically the $“, said Lizzybel while taking out her phone and showing a HackerNews comment to Nanunanu.

“What a silly reason to not want to look deeper into Raku”.  Nanunanu agreed and went on their way because busy, busy, busy!

While going home, Lizzybel was thinking: “On the other hand, it was clear that this was about first impressions. And first impressions are important. So because of this first impression, the Raku Programming Language was potentially missing out on a significant number of new users! What a pity!”

A constant

When back home, Lizzybel thought: “But the Raku Programming Language is no stranger to sigilless constants”

my constant answer = 42;

“is but an example”. “And with a little trick, you could even make sigilless variables”, she was mumbling to herself:

my \answer = my $;
answer = 42;
say answer;  # 42

But that is really yucky. And would not help with a first impression of the Raku Programming Language, at all!

Emojional

Then she was reminded of a playful module she’d made several years ago: Slang::Emoji. It allowed one to define and use variables whose name was a single character emoji, such as 👍 or 🏳️‍🌈:

use Slang::Emoji;
my 👍 = 42;
say 👍;  # 42

To make this possible, she remembered that she had actually sneaked in a special token into the grammar of the Raku programming Language to be able to do this: sigilless-variable. Maybe that token could be used to create sigilless variables in Raku as well?

Nogil

Turns out there had been a Raku slang for sigilless variables already by Martin Tourneboeuf. But sadly that had bitrotted. “Why not use that namespace?”, Lizzybel thought to herself. “Indeed, why not?”. The initial iteration of transmogrifying the Slang::Emoji module into Slang::Nogil looked simple enough. Just replace <.:So> with <.ident>+, and add a check that we’re actually in a definition ($*IN_SPEC), and voila: Slang::Nogil1.1.

use Slang::Nogil;
my answer = 42;
say answer;  # 42

And fortunately, Martin Tourneboeuf was happy with the result.

All was good, but then some issues started to become clear(er).

Nogil vs Emoji

Because both Slang::Emoji and Slang::Nogil mix in a new version of the “sigilless-variable” token, one module was trampling on the other.  Lizzybel realized that a solution in which a mixed-in token would just re-dispatch to the original token, would be the best solution. But alas, after about two days of hacking, it turned out to be still impossible do so in a transparent manner.

So the next best thing was to integrate the Slang::Emoji functionality into Slang::Nogil: an emoji could be considered to be a sigilless identifier after all, could it not?

The result was Slang::Nogil 1.2.

Not enough testing: more trouble

Lizzybel had only tested the most simple cases. But not something like my Int answer = 42. Which fails with a “Two terms in a row” error. Or something even worse, that would affect a lot of code in the wild: my sub a() { }, which also would fail in the same way.

Clearly the “sigilless-variable” approach would either require a more general approach in the Raku grammar, or would involve some serious ad-hoc workaround hacking in the Slang::Nogil module.

Because it was nearly Xmas, Lizzybel opted for the ad-hoc workaround hacking approach for now. “At least people would be able to play around with the use of sigilless variables in Raku, which some people would consider a nice Xmas present” was Lizzybel‘s line of thought.

And after some hackingSlang::Nogil 1.3 saw the light of day.

Not always available, or?

Nanunanu found out about the latest update of Slang::Nogil and enthusiastically send a private message to Lizzybel on IRC: “Very nice, I always wanted to be able to not have to use sigils for variables with limited scopes. And now I can! But I would still always would need to load the Slang::Nogil module in my code, no?”.

Lizzybel answered: “Yes, at the moment you would have to. But fortunately, you can automate that as well with the RAKUDO_OPT environment variable. Just put RAKUDO_OPT=-MSlang::Nogil in your environment, and you don’t need to think about it anymore!”. It was silent on the other end. But that was just because Nanunanu was also busy with something else.

After a few minutes Nanunanu answered: “That’s pretty cool, didn’t know you could do that :-). But of course it would be nicer still if it was just part of Raku, wouldn’t it?”.

A good question

“Should this be part of Raku, perhaps in the next language level?”, wondered Lizzybel. “And should this only apply to variable definitions? Or also to signatures, so you would be able to do something like for <a b c d> -> letter { say letter }. Or would that affect error reporting on common errors too much? Or would we be able to change the grammar and error reporting in such a way that sigilless identifiers in signatures would not be a problem after all?”.

“Perhaps it is time for a language problem solving issue. And an associated Rakudo Pull Request“, thought Lizzybel. “But not now, as I’m still recharging my batteries”.

UPDATE: there is a problem solving issue now!

Day 6 – Robust code generation combining grammars and LLMs

Introduction

This document (notebook) discusses different combinations of Grammar-Based Parser-Interpreters (GBPI) and Large Language Models (LLMs) to generate executable code from Natural Language Computational Specifications (NLCM). We have the soft assumption that the NLCS adhere to a certain relatively small Domain Specific Language (DSL) or use terminology from that DSL. Another assumption is that the target software packages are not necessarily well-known by the LLMs, i.e. direct LLM requests for code using them would produce meaningless results.

We want to do such combinations because:

  • GBPI are fast, precise, but with a narrow DSL scope
  • LLMs can be unreliable and slow, but with a wide DSL scope

Because of GBPI and LLMs are complementary technologies with similar and overlapping goals the possible combinations are many. We concentrate on two of the most straightforward designs: (1) judged parallel race of methods execution, and (2) using LLMs as a fallback method if grammar parsing fails. We show asynchronous programming implementations for both designs using the package LLM::Graph.

The Machine Learning (ML) package “ML::SparseMatrixRecommender” is used to demonstrate that the generated code is executable.

The rest of the document is structured as follows:

  • Initial grammar-LLM combinations
    • Assumptions, straightforward designs, and trade-offs
  • Comprehensive combinations enumeration (attempt)
    • Tabular and morphological analysis breakdown
  • Three methods for parsing ML DSL specs into Raku code
    • One grammar-based, two LLM-based
  • Parallel execution with an LLM judge
    • Straightforward, but computationally wasteful and expensive
  • Grammar-to-LLM fallback mechanism
    • The easiest and most robust solution
  • Concluding comments and observations

TL;DR

  • Combining grammars and LLMs produces robust translators.
  • Three translators with different faithfulness and coverage are demonstrated and used.
  • Two of the simplest, yet effective, combinations are implemented and demonstrated.
    • Parallel race and grammar-to-LLM fallback.
  • Asynchronous implementations with LLM-graphs are a very good fit!
    • Just look at the LLM-graph plots (and be done reading.)

Initial Combinations and Associated Assumptions

The goal is to combine the core features of Raku with LLMs in order to achieve robust parsing and interpretation of computational workflow specifications.

Here are some example combinations of these approaches:

  1. A few methods, both grammar-based and LLM-based, are initiated in parallel. Whichever method produces a correct result first is selected as the answer.
    • This approach assumes that when the grammar-based methods are effective, they will finish more quickly than the LLM-based methods.
  2. The grammar method is invoked first; if it fails, an LLM method (or a sequence of LLM methods) is employed.
  3. LLMs are utilized at the grammar-rule level to provide matching objects that the grammar can work with.
  4. If the grammar method fails, an LLM normalizer for user commands is invoked to generate specifications that the grammar can parse.
  5. It is important to distinguish between declarative specifications and those that prescribe specific steps.
    • For a workflow given as a list of steps the grammar parser may successfully parse most steps, but LLMs may be required for a few exceptions.

The main trade-off in these approaches is as follows:

  • Grammar methods are challenging to develop but can be very fast and precise.
    • Precision can be guaranteed and rigorously tested.
  • LLM methods are quicker to develop but tend to be slower and can be unreliable, particularly for less popular workflows, programming languages, and packages.

Also, combinations based on LLM tools (aka LLM external function calling) are not considered because LLM-tools invocation is too unpredictable and unreliable.


Comprehensive breakdown (attempt)

This section has a “concise” table that expands the combinations list above into the main combinatorial strategies for Raku core features × LLMs for robust parsing and interpretation of workflow specifications. The table is not an exhaustive list of such combinations, but illustrates their diversity and, hopefully, can give ideas for future developments.

A few summary points (for table’s content/subject):

  • Grammar (Raku regex/grammar)
    • Pros: fast, deterministic, validated, reproducible
    • Cons: hard to design for large domains, brittle for natural language inputs
  • LLMs
    • Pros: fast to prototype, excellent at normalization/paraphrasing, flexible
    • Cons: slow, occasionally wrong, hallucination risk, inconsistent output formats
  • Conclusion:
    • The most robust systems combine grammar precision with LLM adaptability, typically by putting grammars first and using LLMs for repair, normalization, expansions, or semantic interpretation (i.e. “fallback”.)

Table: Combination Patterns for Parsing Workflow Specifications

Combination PatternDescriptionProsCons / Trade-offs
Parallel Race: Grammar + LLMLaunch Raku grammar parsing and one or more LLM interpreters in parallel; whichever yields a valid parse first is accepted.• Fast when grammar succeeds
• Robust fallback
• Reduces latency unpredictability of LLMs
• Requires orchestration
• Need a validator for LLM output
Grammar-First, LLM-FallbackTry grammar parser first; if it fails, invoke LLM-based parsing or normalization.• Deterministic preference for grammar
• Testable correctness when grammar succeeds
• LLM fallback may produce inconsistent structures
LLM-Assisted Grammar (Rule-Level)Individual grammar rules delegate to an LLM for ambiguous or context-heavy matching; LLM supplies tokens or AST fragments.• Handles complexity without inflating grammar
• Modular LLM usage
• LLM behavior may break rule determinism
• Harder to reproduce
LLM Normalizer → Grammar ParserWhen grammar fails, LLM rewrites/normalizes input into a canonical form; grammar is applied again.• Grammar remains simple
• Leverages LLM skill at paraphrasing
• Quality depends on normalizer
• Feedback loops possible
Hybrid Declarative vs Procedural ParsingGrammar extracts structural/declarative parts; LLM interprets procedural/stepwise parts or vice versa.• Each subsystem tackles what it’s best at
• Reduces grammar complexity
• Harder to maintain global consistency
• Requires AST stitching logic
Grammar-Generated Tests for LLMGrammar used to generate examples and counterexamples; LLM outputs are validated against grammar constraints.• Powerful for verifying LLM outputs
• Reduces hallucinations
• Grammar must encode constraints richly
• Validation pipeline required
LLM as Adaptive Heuristic for Grammar AmbiguitiesWhen grammar yields multiple parses, LLM chooses or ranks the “most plausible” AST.• Improves disambiguation
• Good for underspecified workflows
• LLM can pick syntactically impossible interpretations
LLM as Semantic Phase After GrammarGrammar creates an AST; LLM interprets semantics, fills in missing steps, or resolves vague ops.• Clean separation of syntax vs semantics
• Grammar ensures correctness
• Semantic interpretation may drift from syntax
Self-Healing Parse LoopGrammar fails → LLM proposes corrections → grammar retries → if still failing, LLM creates full AST.• Iterative and robust
• Captures user intent progressively
• More expensive; risk of oscillation
Grammar Embedding Inside Prompt TemplatesRaku grammar definitions serialized into the prompt, guiding the LLM to conform to the grammar (soft constraints).• Faster than grammar execution in some cases
• Encourages consistent structure
• Weak guarantees
• LLM may ignore grammar
LLM-Driven Grammar Induction or RefinementLLM suggests new grammar rules or transformations; developer approves; Raku grammar evolves over time.• Faster grammar evolution
• Useful for new workflow languages
• Requires human QA
• Risk of regressing accuracy
Raku Regex Engine as LLM GuardrailRaku regex or token rules used to validate or filter LLM results before accepting them.• Lightweight constraints
• Useful for quick prototyping
• Regex too weak for complex syntax

Diversity reasons

  • The diversity of combinations in the table above arises because Raku grammars and LLMs occupy fundamentally different but highly complementary positions in the parsing spectrum.
  • Raku grammars provide determinism, speed, verifiability, and structural guarantees, but they require upfront design and struggle with ambiguity, informal input, and evolving specifications.
  • LLMs, in contrast, excel at normalization, semantic interpretation, ambiguity resolution, and adapting to fluid or poorly defined languages, yet they lack determinism, may hallucinate, and are slower.
  • When these two technologies meet, every architectural choice — who handles syntax, who handles semantics, who runs first, who validates whom, who repairs or refines — defines a distinct strategy.
  • Hence, the design space naturally expands into many valid hybrid patterns rather than a single “best” pipeline.
  • That said, the fallback pattern implemented below can be considered the “best option” from certain development perspectives because it is simple, effective, and has fast execution times.

See the corresponding Morphological Analysis table which correspond to this taxonomy mind-map:


Setup

Here are the packages used in this document (notebook):

use DSL::Translators;
use DSL::Examples;
use ML::NLPTemplateEngine;

use LLM::Graph;

Here are LLM-models access configurations:

my $conf41-mini = llm-configuration('ChatGPT', model => 'gpt-4.1-mini', temperature => 0.45);
my $conf41 = llm-configuration('ChatGPT', model => 'gpt-4.1', temperature => 0.45);
my $conf51 = llm-configuration('ChatGPT', model => 'gpt-5.1', reasoning-effort => 'none');
my $conf-gemini20-flash = llm-configuration('Gemini', model => 'gemini-2.0-flash');

Three DSL translations

This section demonstrates the use of three different translation methods:

  1. Grammar-based parser-interpreter of computational workflows
  2. LLM-based translator using few-shot learning with relevant DSL examples
  3. Natural Language Processing (NLP) interpreter using code templates and LLMs to fill-in the corresponding parameters

The translators are ordered according of their faithfulness, most faithful first.
It can be said that at the same time, the translators are ordered according to their coverage — widest coverage is by the last.

Grammar-based

Here a recommender pipeline specified with natural language commands is translated into Raku code of the package “ML::SparseMatrixRecommender” using a sub of the package “DSL::Translators”:

'
create from @dsData; 
apply LSI functions IDF, None, Cosine; 
recommend by profile for passengerSex:male, and passengerClass:1st;
join across with @dsData on "id";
echo the pipeline value
'
==> ToDSLCode(to => 'Raku', format => 'CODE')
==> {.subst('.', "\n."):g}()

# my $obj = ML::SparseMatrixRecommender
# .new
# .create-from-wide-form(@dsData)
# .apply-term-weight-functions(global-weight-func => "IDF", local-weight-func => "None", normalizer-func => "Cosine")
# .recommend-by-profile(["passengerSex:male", "passengerClass:1st"])
# .join-across(@dsData, on => "id" )
# .echo-value()

For more details of the grammar-based approach see the presentations:

Via LLM examples

LLM translations can be done using a set of from-to rules. This is the so-called few shot learning of LLMs. The package “DSL::Examples” has a collection of such examples for different computational workflows. (Mostly ML at this point.) The examples are hierarchically organized by programming language and workflow name; see the resource file “dsl-examples.json”, or execute dsl-examples.

Here is a table that shows the known DSL translation examples in “DSL::Examples”:

#% html
dsl-examples().map({ $_.key X ($_.value.keys Z $_.value.values».elems) }).flat(1).map({ <language workflow examples-count> Z=> $_.flat })».Hash.sort(*<language workflow>).Array
==> to-dataset()
==> to-html(field-names => <language workflow examples-count>)

languageworkflowexamples-count
PythonLSAMon15
PythonQRMon23
PythonSMRMon20
RLSAMon17
RQRMon26
RSMRMon20
RakuSMRMon20
WLClCon20
WLLSAMon17
WLQRMon27
WLSMRMon20

Here is the definition of an LLM translation function that uses examples:

my &llm-pipeline-segment = llm-example-function(dsl-examples()<Raku><SMRMon>);

Here is a recommender pipeline specified with natural language commands:

my $spec = q:to/END/;
new recommender;
create from @dsData; 
apply LSI functions IDF, None, Cosine; 
recommend by profile for passengerSex:male, and passengerClass:1st;
join across with @dsData on "id";
echo the pipeline value;
classify by profile passengerSex:female, and passengerClass:1st on the tag passengerSurvival;
echo value
END

sink my @commands = $spec.lines;

Translate to Raku code line-by-line:

@commands
.map({ .&llm-pipeline-segment })
.map({ .subst(/:i Output \h* ':'?/, :g).trim })
.join("\n.")

# ML::SparseMatrixRecommender.new;
# .create(@dsData)
# .apply-term-weight-functions('IDF', 'None', 'Cosine')
# .recommend-by-profile({'passengerSex.male' => 1, 'passengerClass.1st' => 1})
# .join-across(@dsData, on => 'id')
# .echo-value()
# .classify-by-profile('passengerSurvival', {'passengerSex.female' => True, 'passengerClass.1st' => True})
# .echo-value()

Or translate by just calling the function over the whole $spec:

&llm-pipeline-segment($spec)

# ML::SparseMatrixRecommender.new;  
# create(@dsData);  
# apply-term-weight-functions('IDF', 'None', 'Cosine');  
# recommend-by-profile({'passengerSex.male' => 1, 'passengerClass.1st' => 1});  
# join-across(@dsData, on => 'id');  
# echo-value();  
# classify-by-profile('passengerSurvival', [{'passengerSex.female' => 1, 'passengerClass.1st' => 1}]);  
# echo-value()

Remark: That latter call is faster, but it needs additional processing for “monadic” workflows.

By NLP Template Engine

Here a “free text” recommender pipeline specification is translated to Raku code using the sub concretize of the package “ML::NLPTemplateEngine”:

'create a recommender with dfTitanic; apply the LSI functions IDF, None, Cosine; recommend by profile 1st and male'
==> concretize(lang => "Raku", e => $conf41-mini)

# my $smrObj = ML::SparseMatrixRecommender.new
# .create-from-wide-form(dfTitanic, item-column-name='id', :add-tag-types-to-column-names, tag-value-separator=':')
# .apply-term-weight-functions('IDF', 'None', 'Cosine')
# .recommend-by-profile(["1st"], 12, :!normalize)
# .join-across(dfTitanic)
# .echo-value();

The package “ML::NLPTemplateEngine” uses a Question Answering System (QAS) implemented in “ML::FindTextualAnswer”. A QAS can be implemented in different ways, with different conceptual and computation complexity. Currently, “ML::FindTextualAnswer” has only an LLM based implementation of QAS.

For more details of the NLP template engine approach see the presentations:


Parallel race (judged): Grammar + LLM

In this section we implement the first, most obvious, and conceptually simplest combination of grammar-based- with LLM-based translations:

  • All translators — grammar-based and LLM-based are run in parallel
  • An LLM judge selects the one that adheres best to the given specification

The implementation of this strategy with an LLM graph (say, by using “LLM::Graph”) is straightforward.

Here is such an LLM graph that:

  • Runs all three translation methods above
  • There is a judge that picks which on of the methods produced best result
  • Judge’s output is used to make a Markdown report
my %rules =
    dsl-grammar => { 
        eval-function => sub ($spec, $lang = 'Raku') { ToDSLCode($spec, to => $lang, format => 'CODE') }
    },

    llm-examples => { 
        llm-function => 
            sub ($spec, $lang = 'Raku', $split = False) { 
                my &llm-pipeline-segment = llm-example-function(dsl-examples(){$lang}<SMRMon>);
                return do if $split {
                    note 'with spec splitting...';
                    my @commands = $spec.lines;
                    @commands.map({ .&llm-pipeline-segment }).map({ .subst(/:i Output \h* ':'?/, :g).trim }).join("\n.")
                } else {
                    note 'no spec splitting...';
                    &llm-pipeline-segment($spec).subst(";\n", "\n."):g
                }
            },
    },

    nlp-template-engine => {
        llm-function => sub ($spec, $lang = 'Raku') { concretize($spec, :$lang) }
    },

    judge => sub ($spec, $lang, $dsl-grammar, $llm-examples, $nlp-template-engine) {
            [
                "Choose the generated code that most fully adheres to the spec:\n",
                $spec,
                "\nfrom the following $lang generation results:\n\n",
                "1) DSL-grammar:\n$dsl-grammar\n",
                "2) LLM-examples:\n$llm-examples\n",
                "3) NLP-template-engine:\n$nlp-template-engine\n",
                "and copy it:"
            ].join("\n\n")
    },
    
    report => {
            eval-function => sub ($spec, $lang, $dsl-grammar, $llm-examples, $nlp-template-engine, $judge) {
                [
                    '# Best generated code',
                    "Three $lang code generations were submitted for the spec:",
                    '```text',
                    $spec,
                    '```',
                    'Here are the results:',
                    to-html( ['dsl-grammar', 'llm-examples', 'nlp-template-engine'].map({ [ name => $_, code => ::('$' ~ $_)] })».Hash.Array, field-names => <name code> ).subst("\n", '<br/>'):g,
                    '## Judgement',
                    $judge.contains('```') ?? $judge !! "```$lang\n" ~ $judge ~ "\n```"
                ].join("\n\n")
            }
    }        
;

my $gBestCode = LLM::Graph.new(%rules)

# LLM::Graph(size => 5, nodes => dsl-grammar, judge, llm-examples, nlp-template-engine, report)

Here is a recommender workflow specification:

my $spec = q:to/END/;
make a brand new recommender with the data @dsData;
apply LSI functions IDF, None, Cosine; 
recommend by profile for passengerSex:male, and passengerClass:1st;
join across with @dsData on "id";
echo the pipeline value;
END

$gBestCode.eval(:$spec, lang => 'Raku', :split)

#    with spec splitting...
#   LLM::Graph(size => 5, nodes => dsl-grammar, judge, llm-examples, nlp-template-engine, report)

Here the LLM-graph result — which is a Markdown report — is rendered:

#% markdown
$gBestCode.nodes<report><result>

Best generated code

Three Raku code generations were submitted for the spec:

make a brand new recommender with the data @dsData;
apply LSI functions IDF, None, Cosine; 
recommend by profile for passengerSex:male, and passengerClass:1st;
join across with @dsData on "id";
echo the pipeline value;



Here are the results:

namecode
dsl-grammar
llm-examplesML::SparseMatrixRecommender.new(@dsData)
.apply-term-weight-functions(‘IDF’, ‘None’, ‘Cosine’)
.recommend-by-profile({‘passengerSex.male’ => 1, ‘passengerClass.1st’ => 1})
.join-across(@dsData, on => ‘id’)
.echo-value()
nlp-template-enginemy $smrObj = ML::SparseMatrixRecommender.new
.create-from-wide-form([“passengerSex:male”, “passengerClass:1st”]set, item-column-name=’id’, :add-tag-types-to-column-names, tag-value-separator=’:’)
.apply-term-weight-functions(‘IDF’, ‘None’, ‘Cosine’)
.recommend-by-profile([“passengerSex:male”, “passengerClass:1st”], 12, :!normalize)
.join-across([“passengerSex:male”, “passengerClass:1st”]set)
.echo-value();

Judgement

ML::SparseMatrixRecommender.new(@dsData)
.apply-term-weight-functions('IDF', 'None', 'Cosine')
.recommend-by-profile({'passengerSex.male' => 1, 'passengerClass.1st' => 1})
.join-across(@dsData, on => 'id')
.echo-value()

LLM-graph visualization

Here is a visualization of the LLM graph defined and run above:

#% html
$gBestCode.dot(engine => 'dot', :9graph-size, node-width => 1.7, node-color => 'grey', edge-color => 'grey', edge-width => 0.4, theme => 'default'):svg

For details on LLM-graphs making and their visualization representations see blog posts:


Fallback: DSL-grammar to LLM-examples

Instead of having DSL-grammar- and LLM computations running in parallel, we can make an LLM-graph in which the LLM computations are invoked if the DSL-grammar parsing-and-interpretation fails. In this section we make such a graph.

Before making the graph let us also generalize it to work with other ML workflows, not just recommendations. The function ToDSLCode (of the package “DSL::Translators”) has an ML workflow classifier based on prefix trees; see [AA1].

Let us make an LLM function with a similar functionality. I.e. an LLM-function that classifies a natural language computation specification into workflow labels used by “DSL::Examples”. Here is such a function using the sub llm-classify provided by “ML::FindTextualAnswer”:

# Natural language labels to be understood by LLMs
my @mlLabels = 'Classification', 'Latent Semantic Analysis', 'Quantile Regression', 'Recommendations';

# Map natural language labels to workflow names in "DSL::Examples"
my %toMonNames = @mlLabels Z=> <ClCon LSAMon QRMon SMRMon>; 

# Change the result of &llm-classify result into workflow names
my &llm-ml-workflow = -> $spec { my $res = llm-classify($spec, @mlLabels, request => 'which of these workflows characterizes it'); %toMonNames{$res} // $res };

# Example invocation
&llm-ml-workflow($spec)

# SMRMon

In addition, we have to specify a pipeline “separator” for the different programming languages:

my %langSeparator = Python => "\n.", Raku => "\n.", R => "%>%\n", WL => "⟹\n";

Here is the LLM-graph:

my %rules =
    dsl-grammar => { 
        eval-function => sub ($spec, $lang = 'Raku') { 
            my $res = ToDSLCode($spec, to => $lang, format => 'CODE'); 
            my $checkStr = 'my $obj = ML::SparseMatrixRecommender.new';
            return do with $res.match(/ $checkStr /):g { 
                $/.list.elems > 1 ?? $res.subst($checkStr) !! $res 
            }
        }
    },

    workflow-name => {
        llm-function => sub ($spec) { &llm-ml-workflow($spec) }
    },

    llm-examples => { 
        llm-function => 
            sub ($spec, $workflow-name, $lang = 'Raku', $split = False) {
                my &llm-pipeline-segment = llm-example-function(dsl-examples(){$lang}{$workflow-name});
                return do if $split {
                    my @commands = $spec.lines;
                    @commands.map({ .&llm-pipeline-segment }).map({ .subst(/:i Output \h* ':'?/, :g).trim }).join(%langSeparator{$lang})
                } else {
                    &llm-pipeline-segment($spec).subst(";\n", %langSeparator{$lang}):g
                }
            },
        test-function => sub ($dsl-grammar) { !($dsl-grammar ~~ Str:D && $dsl-grammar.trim.chars) }
    },
    
    code => {
            eval-function => sub ($dsl-grammar, $llm-examples) {
                $dsl-grammar ~~ Str:D && $dsl-grammar.trim ?? $dsl-grammar !! $llm-examples
            }
    }   
;

my $gRobust = LLM::Graph.new(%rules):!async

# LLM::Graph(size => 4, nodes => code, dsl-grammar, llm-examples, workflow-name)

Here the LLM graph is run over a spec that can be parsed by DSL-grammar (notice the very short computation time):

my $spec = q:to/END/;
create from @dsData; 
apply LSI functions IDF, None, Cosine; 
recommend by profile for passengerSex:male, and passengerClass:1st;
join across with @dsData on "id";
echo the pipeline value;
END

$gRobust.eval(:$spec, lang => 'Raku', :!split)

# LLM::Graph(size => 4, nodes => code, dsl-grammar, llm-examples, workflow-name)

Here is the obtained result:

$gRobust.nodes<code><result>

# my $obj = ML::SparseMatrixRecommender.new.create-from-wide-form(@dsData).apply-term-weight-functions(global-weight-func => "IDF", local-weight-func => "None", normalizer-func => "Cosine").recommend-by-profile(["passengerSex:male", "passengerClass:1st"]).join-across(@dsData, on => "id" ).echo-value()

Here is a spec that cannot be parsed by the DSL-grammar interpreter — note that there is just a small language change in the first line:

my $spec = q:to/END/;
new recommender with @dsData, please; 
also apply LSI functions IDF, None, Cosine; 
recommend by profile for passengerSex:male, and passengerClass:1st;
join across with @dsData on "id";
echo the pipeline value;
END

$gRobust.eval(:$spec, lang => 'Raku', :!split)

#    Cannot parse the command; error in rule recommender-object-phrase:sym<English> at line 1; target 'new recommender with @dsData, please' position 16; parsed 'new recommender', un-parsed 'with @dsData, please' .
#
#    LLM::Graph(size => 4, nodes => code, dsl-grammar, llm-examples, workflow-name)

Nevertheless, we obtain a correct result via LLM-examples:

$gRobust.nodes<code><result>

# ML::SparseMatrixRecommender.new(@dsData)
# .apply-term-weight-functions('IDF', 'None', 'Cosine')
# .recommend-by-profile({'passengerSex.male' => 1, 'passengerClass.1st' => 1})
# .join-across(@dsData, on => 'id')
# .echo-value();

Here is the corresponding graph plot:

#% html
$gRobust.dot(engine => 'dot', :9graph-size, node-width => 1.7, node-color => 'grey', edge-color => 'grey', edge-width => 0.4, theme => 'default'):svg

Let us specify another workflow — for ML-classification with Wolfram Language — and run the graph:

my $spec = q:to/END/;
use the dataset @dsData;
split the data into training and testing parts with 0.8 ratio;
make a nearest neighbors classifier;
show classifier accuracy, precision, and recall;
echo the pipeline value;
END

$gRobust.eval(:$spec, lang => 'WL', :split)

# LLM::Graph(size => 4, nodes => code, dsl-grammar, llm-examples, workflow-name)

$gRobust.nodes<code><result>

#  ClConUnit[dsData]⟹
#    ClConSplitData[0.8]⟹
#    ClConMakeClassifier["NearestNeighbors"]⟹
#    Function[{v,c},ClConUnit[v,c]⟹ClConClassifierMeasurements[{"Accuracy","Precision","Recall"}]⟹ClConEchoValue]⟹
#    ClConEchoValue


Concluding comments and observations

  • Using LLM graphs gives the ability to impose desired orchestration and collaboration between deterministic programs and LLMs.
    • By contrast, the “inversion of control” of LLM-tools is “capricious.”
  • LLM-graphs are both a generalization of LLM-tools, and a lower level infrastructural functionality than LLM-tools.
  • The LLM-graph for the parallel-race translation is very similar to the LLM-graph for comprehensive document summarization described in [AA4].
  • The expectation that DSL examples would provide both fast and faithful results is mostly confirmed in ≈20 experiments.
  • Using the NLP template engine is also fast because LLMs are harnessed through QAS.
  • The DSL examples translation had to be completed with a workflow classifier.
    • Such classifiers are also part of the implementations of the other two approaches.
    • The grammar-based one uses a deterministic classifier, [AA1].
    • The NLP template engine uses an LLM classifier.
  • An interesting extension of the current work is to have a grammar-LLM combination in which when the grammar fails then the LLM “normalizes” the specs until the grammar can parse them.
    • Currently, LLM-graph does not support graphs with cycles, hence this approach “can wait” (or be implemented by other means.)
  • Multiple DSL examples can be efficiently derived by random sentence generation with the different grammars.
    • Similar to the DSL commands classifier making approach taken in [AA1].
  • LLMs can be also used to improve and extend the DSL grammars.
    • And it is interesting to consider automating that process, instead of doing it via human supervision.

References

Articles, blog posts

[AA1] Anton Antonov, “Fast and compact classifier of DSL commands”, (2022), RakuForPrediction at WordPress.

[AA2] Anton Antonov, “Grammar based random sentences generation, Part 1”, (2023), RakuForPrediction at WordPress.

[AA3] Anton Antonov, “LLM::Graph”, (2025), RakuForPrediction at WordPress.

[AA4] Anton Antonov, “Agentic-AI for text summarization”, (2025), RakuForPrediction at WordPress.

[AA5] Anton Antonov, “LLM::Graph plots interpretation guide”, (2025), RakuForPrediction at WordPress.

Packages

[AAp1] Anton Antonov, DSL::Translators, Raku package, (2020-2025), GitHub/antononcube.

[AAp2] Anton Antonov, ML::FindTextualAnswer, Raku package, (2023-2025), GitHub/antononcube.

[AAp3] Anton Antonov, MLP::NLPTemplateEngine, Raku package, (2023-2025), GitHub/antononcube.

[AAp4] Anton Antonov, DSL::Examples, Raku package, (2024-2025), GitHub/antononcube.

[AAp5] Anton Antonov, LLM::Graph, Raku package, (2025), GitHub/antononcube.

[AAp6] Anton Antonov, ML::SparseMatrixRecommender, Raku package, (2025), GitHub/antononcube.

Videos

[AAv1] Anton Antonov, “NLP Template Engine, Part 1”, (2021), YouTube/@AAA4prediction.

[AAv2] Anton Antonov, “Natural Language Processing Template Engine”, (2023), YouTube/@WolframResearch.

[WRIv1] Wolfram Research, Inc., “Live CEOing Ep 886: Design Review of LLMGraph”, (2025), YouTube/@WolframResearch.