Day 8 – Parsing Firefox’ user.js with Raku (Part 2)

Yesterday, we made a short Grammar that could parse a single line of the user.js that Firefox uses. Today, we’ll be adding a number of test cases to make sure everything we want to match will match properly. Additionally, the Grammar can be expanded to match multiple lines, so we can let the Grammar parse an entire user.js file in a single call.

Adding more tests

To get started with matching other argument types, we should extend the list of
test cases that are defined in MAIN. Let’s add a couple to match true,
false, null and integer values.

my @inputs = (
  'user_pref("browser.startup.homepage", "https://searx.tyil.nl");',
  'user_pref("extensions.screenshots.disabled", true);',
  'user_pref("browser.search.suggest.enabled", false);',
  'user_pref("i.have.no.nulls", null);',
  'user_pref("browser.startup.page", 3);',
);

I would suggest to update the for loop as well, to indicate which input it is currently trying to match. Things will fail to match, and it will be easier to see which output belongs to which input if we just print it out.

for @inputs {
  say "\nTesting $_\n";
  say UserJS.parse($_);
}

If you run the script now, you’ll see that only the first test case is actually
working, while the others all fail on the argument. Let’s fix each of these
tests, starting at the top.

Matching other types

To make it easy to match all sorts of types, let’s introduce a proto regex. This will help keep everything into small, managable blocks. Let’s also rename the argument rule to constant, which will more aptly describe the things we’re going to match with them. Before adding new functionalities, let’s see what the rewritten structure would be.

rule argument-list {
  '('
  <( <constant>+ % ',' )>
  ')'
}

proto rule constant { * }

rule constant:sym<string> {
  '"'
  <( <-["]>+? )>
  '"'
}

As you can see, I’ve given the constant the sym adverb named string. This makes it easy to see for us that it’s about constant strings. Now we can also easily add additional constant types, such as booleans.

rule constant:sym<boolean> {
  | 'true'
  | 'false'
}

This will match both the bare words true and false. Adding just this and running the script once more will show you that the next two test cases are now working. Adding the null type is just as easy.

rule constant:sym<null> {
  'null'
}

Now all we need to pass the 5th test case is parsing numbers. In JavaScript, everything is a float, so let’s stick to that for our Grammar as well. Let’s accept one or more numbers, optionally followed by both a dot and another set of numbers. Of course, we should also allow a - or a + in front of them.

rule constant:sym<float> {
  <[+-]>? \d+ [ "." \d+ ]?
}

Working out some edge cases

It looks like we can match all the important types now. However, there’s some edge cases that are allowed that aren’t going to work yet. A big one is of course a string containing a "`. If we add a test case for this, we can see it failing when we run the script.

my @inputs = (
  ...
  'user_pref("double.quotes", "\"my value\"");',
);

To fix this, we need to go back to constant:sym, and alter the rule to take escaped double quotes into account. Instead of looking for any character that is not a ", we can alter it to look for any character that is not directly following a \, because that would make it escaped.

rule constant:sym<string> {
  '"'
  <( .*? <!after '\\'> )>
  '"'
}

Parsing multiple lines

Now that it seems we are able to handle all the different user_pref values that Firefox may throw at us, it’s time to update the script to parse a whole file. Let’s move the inputs we have right now to user.js, and update the MAIN subroutine to read that file.

sub MAIN () {
  say UserJS.parse('user.js'.IO.slurp);
}

Running the script now will print a Nil value on STDOUT, but if you still have Grammar::Tracer enabled, you’ll also notice that it has no complaints. It’s all green!

The problem here is that the TOP rule is currently instructed to only parse a single user_pref line, but our file contains multiple of such lines. The parse method of the UserJS Grammar expects to match the entire string it is told to parse, and that’s causing the Grammar to ultimately fail.

So, we’ll need to alter the TOP rule to allow matching of multiple lines. The easieset way is to wrap the current contents into a group, and add a quantifier to that.

rule TOP {
  [
    <function-name>
    <argument-list>
    ';'
  ]*
}

Now it matches all lines, and correctly extracts the values of the user_pref statements again.

Any comments?

There is another edge case to cover: comments. These are allowed in the user.js file, and when looking up such files online for preset configurations, they’re often making extensive use of them. In JavaScript, comments start with // and continue until the end of the line.

We’ll be using a token instead of a rule for this, since that doesn’t handle whitespace for us. The newline is a whitespace character, and is significant for a comment to denote its end. Additionally, the TOP rule needs some small alteration again to accept comment lines as well. To keep things readable, we should move over the current contents of the matching group to it’s own rule.

rule TOP {
  [
  | <user-pref>
  | <comment>
  ]*
}

token comment {
  '//'
  <( <-[\n]>* )>
  "\n"
}

rule user-pref {
  <function-name>
  <argument-list>
  ';'
}

Now you should be able to parse comments as well. It shouldn’t matter wether they are on their own line, or after a user_pref statement.

## Make it into an object

What good is parsing data if you can’t easily play with it afterwards. So, let’s make use of Grammar Actions to transform the Match objects into a list of UserPref objects. First, let’s declare what the class should look like.

class UserPref {
  has $.key;
  has $.value;

  submethod Str () {
    my $value;

    given ($!value) {
      when Str  { $value = "\"$!value\"" }
      when Num  { $value = $!value }
      when Bool { $value = $!value ?? 'true' !! 'false' }
      when Any  { $value = 'null' }
    }

    sprintf('user_pref("%s", %s);', $!key, $value);
  }
}

A simple class containing a key and a value, and some logic to turn it back into a string usable in the user.js file. Next, creating an Action class to make these objects. An Action class is like any regular class. All you need to pay attention to is to name the methods the same as the rules used in the Grammar.

class UserJSActions {
  method TOP ($/) {
    make $/.map({
      UserPref.new(
        key => $_[0].made,
        value => $_[1].made,
      )
    })
  }

  method constant:sym<boolean> ($/) {
    make (~$/ eq 'true' ?? True !! False)
  }

  method constant:sym<float> ($/) {
    make +$/
  }

  method constant:sym<null> ($/) {
    make Any
  }

  method constant:sym<string> ($/) {
    make ~$/
  }
}

The value methods convert the values as seen in the user.js to Raku types. The TOP method maps over all the user_pref statements that have been parsed, and turns each of them into a UserPref object. Now all that is left is to add the UserJSActions class as the Action class for the parse call in MAIN, and use its made value.

sub MAIN () {
  my $match = UserJS.parse('user.js'.IO.slurp, :actions(UserJSActions));

  say $match.made;
}

Now we can also do things with it. For instance, we can sort all the user_pref statements alphabatically.

sub MAIN () {
  my $match = UserJS.parse('user.js'.IO.slurp, :actions(UserJSActions));
  my @prefs = $match.made;

  for @prefs.sort(*.key) {
    .Str.say
  }
}

Sorting alphabetically may be a bit boring, but you have all sorts of possibilities now, such as filtering out certain options or comments, or merging in multiple files from multiple sources.

I hope this has been an interesting journey into parsing a whole other programming language using Raku’s extremely powerful Grammars!

The complete code

parser.pl6

class UserPref {
  has $.key;
  has $.value;

  submethod Str () {
    my $value;

    given ($!value) {
      when Str  { $value = "\"$!value\"" }
      when Num  { $value = $!value }
      when Bool { $value = $!value ?? 'true' !! 'false' }
      when Any  { $value = 'null' }
    }

    sprintf('user_pref("%s", %s);', $!key, $value);
  }
}

class UserJSActions {
  method TOP ($/) {
    make $/.map({
      UserPref.new(
        key => $_[0].made,
        value => $_[1].made,
      )
    })
  }

  method constant:sym<boolean> ($/) {
    make (~$/ eq 'true' ?? True !! False)
  }

  method constant:sym<float> ($/) {
    make +$/
  }

  method constant:sym<null> ($/) {
    make Any
  }

  method constant:sym<string> ($/) {
    make ~$/
  }
}

grammar UserJS {
  rule TOP {
    [
    | <user-prefix>
    | <comment>
    ]*
  }

  token comment {
    '//' <( <-[\n]>* )> "\n"
  }

  rule user-pref {
    <function-name>
    <argument-list>
    ';'
  }

  rule function-name {
    'user_pref'
  }

  rule argument-list {
    '('
    <( <constant>+ % ',' )>
    ')'
  }

  proto rule constant { * }

  rule constant:sym<string> {
    '"'
    <( .*? <!after '\\'> )>
    '"'
  }

  rule constant:sym<boolean> {
    | 'true'
    | 'false'
  }

  rule constant:sym<null> {
    'null'
  }

  rule constant:sym<float> {
    <[+-]>? \d+ [ "." \d+ ]?
  }
}

sub MAIN () {
  my $match = UserJS.parse('user.js'.IO.slurp, :actions(UserJSActions));
  my @prefs = $match.made;

  for @prefs.sort(*.key) {
    .Str.say
  }
}

user.js

// Comments are welcome!

user_pref("browser.startup.homepage", "https://searx.tyil.nl");
user_pref("extensions.screenshots.disabled", true); //uwu
user_pref("browser.search.suggest.enabled", false);
user_pref("i.have.no.nulls", null);
user_pref("browser.startup.page", +3);
user_pref("double.quotes", "\"my value\"");

7 thoughts on “Day 8 – Parsing Firefox’ user.js with Raku (Part 2)

  1. You have to name the sym adverbs with different type-tags to avoid a compilation error (Package ‘UserJS’ already has a regex ‘constant:sym’ (did you mean to declare a multi-method?)). Change above code to e.g.

    rule constant:sym {
    ‘”‘
    <( <-[“]>+? )>
    ‘”‘
    }

    rule constant:sym {
    | ‘true’
    | ‘false’
    }

    rule constant:sym {
    ‘null’
    }

    rule constant:sym {
    <[+-]>? \d+ [ “.” \d+ ]?
    }

    Like

    1. There are names following the :sym parts in my original, but WordPress does not like those greater/less than characters one bit. I thought I fixed them all, but I clearly missed some, my apologies. I went through it again and I hope to have them all fixed now.

      Like

  2. Thank you for this article! On my browser, all the constant-related subs show up as

    constant:sym
    

    with nothing further to distinguish them. Is there something else I should be able to see but can’t that distinguishes string literals from int literals (for example)? Much appreciated!

    Like

    1. Yes, sorry about that. WordPress didn’t play nice with greater/less than characters that follow those. I’ve updated (and hopefully fixed!) the article to include these again.

      Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: