Day 15 – An expression language for Vixen

#raku-beginners: korvo: Hi! I’m trying out Raku in stead of META II for a toy compiler.

(5 minutes later) korvo: I’m merely trying to compile a little expression language because my angle of repose has slightly increased, and multiple folks have recommended Raku for parsing and lightweight compilers.

(20 minutes later) korvo: [T]hanks for a worthy successor to META II. This is the first enjoyable parser toolkit I’ve used in a while; I spent almost no time fussing over the Regex tools, found it easy to refactor productions, and am spending most of my time trying to handle strings and lists and I/O.

Happy Holiday! As we enter the winter-solstice season, it’s worth reflecting on the way that hunkering down for the cold can bring about new avenues of exploration. I have a jar of pickled homegrown banana peppers in the corner of my refrigerator slowly evolving into something delicious; I do have to shake it every few days, but it will be worth it to add that fruity sour spice to any dish. Similarly, this can be a season for letting old concepts ferment into new concepts.

I have written so many parsers and parser toolkits. I’m tired of writing parsers yet again. So, after hearing about Raku’s builtin support for grammars, I decided that I would commit to trying Raku for my next parsing task. How hard could it be? I’ve already used and forgotten Perl 5 and Ruby.

I don’t need to reinvent Raku. ~ me, very tired

Problem

I’m working on a language that I call Vixen. I should back up.

At the beginning of the year, Joel Jakubovic blogged that The Unix Binary wants to be a Smalltalk Method, Not an Object. They argue that, while we have traditionally thought that Unix files correspond to objects, we should instead correspond Unix directories with objects and Unix files with methods. By “object” I mean a bundle of state and behavior which communicates with other objects by passing messages to them. This is a big deal for folks who study what “object” means, but not really interesting for the wider programming world. However, they followed it up with a prototype and a paper, The Unix Executable as a Smalltalk Method: And its implications for Unix-Smalltalk unification. Jakubovic provides a calling convention, which we call Smalltix, that allows us to treat a Unix system as if it were a Smalltalk-like message-passing object-based system. Crucially, there isn’t a single language for programming Smalltix, because of fragmentation: a Unix system already has many different languages for writing executable programs, and adding another language would only fragment the system further.

Okay! So, I’m working on Vixen, a fork of Smalltix. Jakubovic used Bash and Smalltalk-style classes; I’m simplifying by using execline and Self-style prototypes. Eventually, I’ve got a few dozen little scripts written with execline. Can I simplify further?

Now, I fully admit that execline is something of an alien language, and I should explain at least some of it before continuing. Execline is based on the idea of Bernstein chain loading; the interpreter takes all arguments in argv as a program and calls into multiple helpers which incrementally rewrite argv into a final command. Here’s an example method that I call “debug:” which takes a single argument and prints it to stderr. First it uses the fdmove helper to copy file descriptor 2 to file descriptor 1, shadowing stdout with stderr; finally, it echoes a string that interpolates the first and second items of argv. The calling convention in Smalltix and Vixen is that argv’s zeroth item is the method, the first item is the receiving object passed as a path to a directory, and the rest of the items are positional arguments. By tradition, there is one colon “:” in the name of a method per argument, so “debug:” takes one argument; also by tradition, the names of methods are called verbs. Since this method takes one positional argument, we pass the -s2 flag to the execline interpreter execlineb to collect argv up to index 2.

#!/usr/bin/env -S execlineb -s2
fdmove -c 1 2
echo "debug: ${1}: ${2}"

For something more complicated, here is a method named “allocateNamed:” which augments some other “allocate” method with the ability to control the name of a directory. This lets us attach names to otherwise-anonymous objects. Here, we import the name “V” from the environment envp to turn it into a usable variable. In Vixen, I’ve reserved the name “V” to refer to a utility object that can perform calls and other essential tasks. The backtick helper wraps a subprogram in a curly-brace-delimited block and captures its output. The foreground helper runs two subprograms in sequence; there’s also an if helper which exits early if the first subprogram fails.

#!/usr/bin/env -S execlineb -s2
importas -iS V
backtick -E path { ${V}/call: $1 allocate }
foreground { mkdir ${path}/${2} }
echo ${path}/${2}

Now, as the reader may know, object-based languages are all about messages, object references, and passing messages to object references. In some methods, like this one called “hasParent”, we are solely passing messages to objects; the method is merely a structure which composes some other objects. This is starting to be a lot of code; surely there’s a better way to express this composite?

#!/usr/bin/env -S execlineb -s1
importas -iS V
backtick -E parent { ${V}/call: $1 at: "parent*" }
${V}/call: $parent exists

Syntax

Let’s fragment the system a little bit by introducing an expression language just for this sort of composition. Our justification is that we aren’t going to actually replace execline; we’re just going to make it easier to write. We’ll scavenge some grammar from a few different flavors of Smalltalk. The idea is that our program above could be represented by something like:

[|^(self at: "parent*") exists]

For non-Smalltalkers, this is a block, a fundamental unit of code. The square brackets delimit the entire block. The portion to the right of the pipe “|” is a list of expressions; here there is only one. When the final expression starts with a caret “^”, it will become the answer or return value; there’s a designated Nil object that is answered by default. Expressions are merely messages passed to objects, with the object on the left and the message on the right. If a message verb ends with a colon “:” then it is called a keyword and labels an argument; for each verb with a colon there is a corresponding argument. The builtin name self refers to the current object.

The parentheses might seem odd at first! In Smalltalk, applications of verbs without arguments, so-called unary applications, bind tighter than keyword applications. If we did not parenthesize the example then we would end up with the inner call "parent*" exists, which is a unary application onto a string literal. We also must parenthesize to distinguish nested keyword applications, as in the following example:

[:source|
obj := (self at: "NixStore") intern: source.
^self at: obj name put: obj]

Here we can see the assignment token “:=” for creating local names. The full stop “.” occurs between expressions; it creates statements, which can either assign to a name or not. We can also see a parameter to this block, “:source”, which occurs on the left side of the pipe “|” and indicates that one argument can be passed along with any message.

Grammar

Okay, that’s enough of an introduction to Vixen’s expression language. How do we parse it? That’s where Raku comes in! (As Arlo Guthrie might point out, this is a blog post about Raku.) Our grammar features everything I’ve shown so far, as well as a few extra features like method cascading with the semicolon “;” for which I don’t have good example usage.

grammar Vixen {
    token id       { <[A..Za..z*]>+ <![:]> }
    token selector { <[A..Za..z*]>+ \: }
    token string   { \" <-["]>* \" }
    token param    { \: <id> }
    token ass      { \:\= }

    rule params { <param>* % <ws> }

    proto rule lit             {*}
          rule lit:sym<block>  { '[' <params> '|' <exprs> ']' }
          rule lit:sym<paren>  { '(' <call> ')' }
          rule lit:sym<string> { <string> }
          rule lit:sym<name>   { <id> }

    rule chain { <id>* % <ws> }

    proto rule unary {*}
          rule unary:sym<chain> { <lit> <chain> }

    rule keyword { <selector> <unary> }

    proto rule message {*}
          rule message:sym<key>  { <keyword>+ }

    rule messages { <message>* % ';' }

    proto rule call {*}
          rule call:sym<cascade> { <unary> <messages> }

    proto rule assign {*}
          rule assign:sym<name> { <id> <ass> <call> }
          rule assign:sym<call> { <call> }

    rule statement { <assign> '.' }

    proto rule exprs {*}
          rule exprs:sym<rv>  { <statement>* '^' <call> }
          rule exprs:sym<nil> { <statement>* <call> }

    rule TOP { '[' <params> '|' <exprs> ']' }
}

Writing the grammar is mostly a matter of repeatedly giving it example strings. The one tool that I find indespensible is some sort of debugging tracer which indicates where a parse rule has failed. I used Grammar::Tracer, available via zef. I’m on NixOS, so language-specific package managers don’t always play nice, but zef works and is recommended. First I ran:

$ zef install Grammar::Tracer

And then I could start my file with a single import in order to get tracing:

import Grammar::Tracer;

Actions

The semantic actions transform the concrete syntax tree to abstract syntax. This sort of step is not present in classic META II but is essential for maintaining sanity. I’m going to use this grammar for more than a few weeks, so I wrote a few classes for representing abstract syntax and a class of actions. Some actions are purely about extraction; for example, the method for the params production merely extracts a list of matches and extracts the Str for each match.

    method params($/) { make $.values.map: *.Str; }

Some actions contain optimizations that avoid building abstract syntax. The following method for unary handles chained messages, where we have multiple unary applications in a row; we want a special case for zero applications so that the VUnary class can assume that it always has at least one application.

    method unary:sym<chain>($/) {
        my $receiver = $.made;
        my @verb = $.made;
        make @verb ?? VUnary.new(:$receiver, :@verb) !! $receiver;
    }

Some actions build fresh abstract syntax not in the original program. The following method for exprs handles the case when there is no return caret; the final expression is upgraded to a statement which ignores its return value and the name Nil is constructed as the actual return value.

    method exprs:sym<nil>($/) {
        my @statements = $.values.map: *.made;
        my $call = $.made;
        @statements.push: VIgnore.new(:$call);
        my $rv = VName.new(:n("Nil"));
        make VExprs.new(:@statements, :$rv);
    }

Getting the actions right was difficult. I ended up asking for hints on IRC about how to work with matches. The .values method is very useful.

Abstract syntax

I had a couple false starts with the abstract syntax. I think that the right mentality is to have one node per production, but to have one role per compiler action. If necessary, change the grammar to make the abstract syntax easier to generate; Raku is flexible enough to allow grammars to be refactored. Rules like params, chain, and keyword were broken out to make life easier.

By the way, starting at this point, I am only showing excerpts from the complete compiler. The compiler is available in a separate gist. Classes may be incomplete; only relevant methods and attributes are shown.

For example, there is a role for emitting literals. A parenthesized call just unwraps the parentheses; a string is represented by itself.

role EmitLiteral {
    method prepareLiteral($compiler) { ... }
}
class VParen does EmitLiteral {
    has Call $.call;
    method prepareLiteral($compiler) { $.call.prepareLiteral: $compiler; }
}
class VStr does EmitLiteral {
    has Str $.s;
    method prepareLiteral($compiler) { $.s; }
}

We can also have a role for performing a call. We need two flavors of call: call and bind to a name, and also call without binding. It’s much easier to compile chains and cascades with the option to bind or not bind. We can put both roles onto a single class, so that a cascading application both performs a call and also evaluates to a literal expression.

role Call {
    method prepareBind($name, $compiler) { ... }
    method prepareOnly($compiler) { ... }
}
class VCall does Call does EmitLiteral {
    has EmitLiteral $.unary;
    has VMessage @.cascades;
    # NB: @cascades is inhabited!
    method prepareBind($name, $compiler) {
        my $receiver = $.unary.prepareLiteral: $compiler;
        my $last = @.cascades[*-1];
        for @.cascades[0 ...^ @.cascades.elems - 1] {
            my ($verb, @row) = $_.prepareMessage: $compiler;
            $compiler.callOnly: $receiver, $verb, @row;
        };
        my ($verb, @row) = $last.prepareMessage: $compiler;
        $compiler.callBind: $name, $receiver, $verb, @row;
    }
    method prepareOnly($compiler) {
        my $receiver = $.unary.prepareLiteral: $compiler;
        for @.cascades {
            my ($verb, @row) = $_.prepareMessage: $compiler;
            $compiler.callOnly: $receiver, $verb, @row;
        };
    }
    method prepareLiteral($compiler) {
        my $name = $compiler.gensym;
        self.prepareBind: $name, $compiler;
        "\$" ~ $name;
    }
}

A first compiler

We’ll start by compiling just one block. Our compiler will include a gensym: a method which can generate a symbol that hasn’t been used before. I’m not trying very hard here and it would be easy for a malicious user to access generated symbols; we can fix that later. The compiler is mostly going to store calls; each call can either be a backtick or an if (or foreground) depending on whether it binds a name.

class Compiler {
    has Int $!syms;
    method gensym { $!syms += 1; "gs" ~ $!syms; }

    has Str %.lines;
    method push($line) { %.lines ~= $line ~ "\n"; }

    method callBind($name, $receiver, $verb, @args) {
        self.push: "backtick -E $name \{ " ~ formatCall($receiver, $verb, @args) ~ " \}";
    }
    method callOnly($receiver, $verb, @args) {
        self.push: "if \{ " ~ formatCall($receiver, $verb, @args) ~ " \}";
    }

    method assignName($from, $to) { self.push: "define $to \$$from"; }
}

The method .assignName is needed to handle assignments without intermediate calls, as in this := that.

class VName does Call does EmitLiteral {
    has Str $.n;
    method prepareBind($name, $compiler) { $compiler.assignName: $.n, $name; }
    method prepareOnly($compiler) {;}
    method prepareLiteral($compiler) { "\$" ~ $.n; }
}

Calling into Vixen

To compile multiple blocks, we will need to emit multiple blocks. A reasonable approach might be to emit a JSON Object where each key is a block name and each value is a String containing the compiled block. I’m feeling more adventurous than that, though. Here’s a complete Smalltix/Vixen FFI:

sub callVixen($receiver, $verb, *@args) {
    my $proc = run %*ENV ~ "/call:", $receiver, $verb, |@args, :out;
    my $answer = $proc.out.slurp: :close;
    $proc.sink;
    $answer.trim;
}

Vixen is merely a calling convention for processes; we can send a message to an object by doing some string formatting and running a subprocess. The response to a message, called an answer, is given by stdout. Non-zero return codes indicate failure and stderr will contain useful information for the user. The rest of the calling convention is handled by passing envp and calling the V/call: entrypoint.

In addition to passing V in the environment, we will assume that there are Allocator and NixStore objects. Allocator allocates new objects and NixStore interacts with the Nix package manager; we will allocate a new object and store it in the Nix store. The relevant methods are V/clone: anAllocator, which allocates a shallow copy of V and serves as a blank object template, and NixStore/intern: anObject, which recursively copies an object from a temporary directory into the Nix store.

The reader doesn’t need to know much about Nix. The only relevant part is that the Nix store is a system-wide immutable directory that might not be enumerable; it’s a place to store packages, but it’s hard to alter packages or find a package that the user hasn’t been told about.

Name analysis

We will need to know whether a name is used by a nested block. When we create an object representing that block, we will provide that object with each name that it uses. This is called name-use analysis or just use analysis and it is a type of name analysis. The two effects worth noting are when an expression uses a name and when a statement assigns to a name. We track the used names with a Set[Str]. For example, a keyword message uses a name if any of its arguments use a name:

class VMessage {
    has VKeyword @.keywords;
    method uses(--> Set[Str]) { [(|)] @.keywords.map({ $_.uses }) }
}

A sequence of expressions has its usage computed backwards; every time an expression is assigned to a name, we let that assignment shadow any further uses by removing it from the set of used names. This can be written with reduce but it’s important to preserve readability since this sort of logic can be subtly buggy and often must be revisted during debugging.

class VExprs {
    has EmitStatement @.statements;
    has EmitLiteral $.rv;
    method uses(--> Set[Str]) {
        my $names = $.rv.uses;
        for @.statements.reverse {
            $names = ($names (-) $_.writes) (|) $_.call.uses;
        };
        $names;
    }
}

The .writes method merely produces the set of assigned names:

class VAssignName does EmitStatement {
    has Str $.target;
    method writes(--> Set[Str]) { Set[Str].new($.target) }
}

A second compiler

We now are ready to compile nested blocks. The overall workflow is to compute a closure for the inner block whose names are all used names in the block, except for parameters and global names. We rename everything in the closure with fresh symbols to avoid clashes and allow names like “self” to be closed over. We produce two scripts. One script accepts the closure’s values and attaches them to a new object; one script loads the closure and performs the action in the nested block upon the new object. We call into Vixen to allocate the prototype for the block, populate it, and intern it into the Nix store. Everything else is support code.

        my $closureNames = $uses (-) ($params (|) %globals);
        my %closure = $closureNames.keys.map: { $_ => $compiler.gensym ~ "_" ~ $_ };
        my $valueVerb = @.params ?? "value:" x @.params.elems !! "value";
        my $closureVerb = %closure ?? %closure.keys.map(* ~ ":").join !! "make";
        my $valueBlock = produceValueBlock($compiler, %closure, @.params, $.exprs);
        my $closureBlock = cloneForClosure($compiler, %closure);
        my $obj = callVixen(%*ENV, "clone:", $allocator);
        $compiler.writeBlock: $obj, $valueVerb, $valueBlock;
        $compiler.writeBlock: $obj, $closureVerb, $closureBlock;
        my $interned = callVixen(%*ENV, "intern:", $obj);

One hunk of support code is in the generation of the scripts with produceValueBlock and cloneForClosure. These are open-coded actions against the $compiler object:

sub cloneForClosure($compiler, %closure) {
    my $name = $compiler.genblock;
    $compiler.pushBlock: $name, %closure.keys;
    my $obj = $compiler.gensym;
    my $selfName = $compiler.useName: "self";
    $compiler.callBind: $obj, $selfName, "clone:", ($allocator,);
    my $rv = $compiler.useName: $obj;
    for %closure.kv -> $k, $v {
        my $arg = $compiler.useName: $k;
        $compiler.foreground: "redirfd -w 1 $rv/$v echo " ~ $arg;
    }
    $compiler.finishBlock: $rv;
    $name;
}
sub produceValueBlock($compiler, %closure, @params, $exprs) {
    my $name = $compiler.genblock;
    $compiler.pushBlock: $name, @params;
    my $selfName = $compiler.useName: "self";
    for %closure.kv -> $k, $v { $compiler.callBind: $k, $selfName, $v, [] };
    my $rv = $exprs.compileExprs: $compiler;
    $compiler.finishBlock: $rv;
    $name;
}

The compiler was augmented with methods for managing scopes of names and reassigning names, so that the define helper is no longer used at all. There’s also a method .writeBlock which encapsulates the process of writing out a script to disk.

class Compiler {
    has Hash[Str] @.scopes;
    method assignName($from, $to) { @.scopes[*-1]{ $to } = $from }
    method useName($name) {
        for @.scopes.reverse {
            return "\$\{" ~ $_{ $name } ~ "\}" if $_{ $name }:exists;
        };
        die "Name $name not in scope!";
    }
    method writeBlock($obj, $verb, $blockName) {
        spurt "$obj/$verb", %.lines{ $blockName }.trim-leading;
        chmod 0o755, "$obj/$verb";
    }
}

Closing thoughts

This compiler is less jank than the typical compiler. There’s a few hunks of duplicated code, but otherwise the logic is fairly clean and direct. Raku supports a clean compiler mostly by requiring a grammar and an action class; I had started out by writing imperative spaghetti actions, and it was up to me to decide to organize further. To optimize, it might be worth virtualizing assignments so that there is only one convention for calls; this requires further bookkeeping to not only track renames but also name usage. Indeed, at that point, the reader is invited to consider what SSA might look like. Another possible optimization is to skip allocating empty closures for blocks which don’t close over anything.

It was remarkably easy to call into Vixen from Raku. I could imagine using the FFI as scaffolding to incrementally migrate this compiler to a self-hosting expression-language representation of itself. I could also imagine extending the compiler with FFI plugins that decorate or cross-cut compiler actions.

This blogpost is currently over 400 lines. The full compiler is under 400 lines. We put Raku into a jar with Unix, Smalltalk, and Nix; shook it up, and let it sit for a few days. The result is a humble compiler for a simple expression language with a respectable amount of spice. Thanks to the Raku community for letting me share my thoughts. To all readers, whether you’re Pastafarian or not, whether you prefer red sauce or white sauce or even pesto, I hope that you have a lovely and peaceful Holiday.

Day 15 – An expression language for Vixen

Problem

Syntax

Grammar

Actions

Abstract syntax

A first compiler

Calling into Vixen

Name analysis

A second compiler

Closing thoughts

Published by Corbin

Leave a comment Cancel reply

Problem

Syntax

Grammar

Actions

Abstract syntax

A first compiler

Calling into Vixen

Name analysis

A second compiler

Closing thoughts

Share this:

Related

Published by Corbin

Leave a comment Cancel reply