Day 17 – Writing some horrible Raku code this Christmas!

Santa only had a few days left to make sure everything was ready to go, but with all the stress of the season, he needed a break to recharge. He grabbed some cocoa and hid away in a nook to relax his body and distract his mind, tuning into one of his favorite Youtubers, Matt Parker.

Parker finds interesting mathematical problems that he attempts to untangle and present to the audience in a tractable way, and as he analyzes the problems, he often has to write “some horrible python code.” Santa, of course, will use his favorite language instead: the Raku Programming Language!

Maybe if Santa’s brain was working on one of these puzzles, it’d help him stop thinking about all the other work he was supposed to be doing.

The Problem

So, what to work on in this precious downtime? Santa wants to work on something a little practical, so he doesn’t feel too guilty about taking some time off – let’s figure out how much we’re going to have to expand the shop in the next few years!

A quick google search gets us to some UN data – surely that’s a good start. Santa creates a sandbox folder, and manually downloads and unzips it. For small projects like this, Santa likes to attack the problem in chunks rather than map the whole project at once. First, he makes sure he can read the data at all:

my $data-file  = "data.csv".IO;
my $data = $data-file.lines;
my $headers = $data[0];
dd $headers.split(',');

("SortOrder", "LocID", "Notes", "ISO3_code", "ISO2_code", "SDMX_code", "LocTypeID", "LocTypeName", "ParentID", "Location", "VarID", "Variant", "Time", "MidPeriod", "AgeGrp", "AgeGrpStart", "AgeGrpSpan", "PopMale", "PopFemale", "PopTotal").Seq

Alright, the CSV starts with a row of headers, so we read it in, grab the first row, and do a data dump of that row. We ignore all the possible complexity of CSV, we’ll deal with that if we need to.

Filtering

We are only interested in getting estimates on kids, so let’s filter through the data. Santa can ignore anything where the starting age is 15 or higher, at least for this project.

We peeked at the headers, we know which columns the data we need is in, so we’ll hardcode it for now. Santa gets the age first since that’s our filter, and only prints out the data if the row is good!

my $data-file  = "data.csv".IO;
my $data = $data-file.lines;
my $headers = $data[0];
my $count = 0;
for @($data) -> $line {
    $count++;
    next if $count == 1; # skip the headers
    my @row = $line.split(',');

    my $age  = @row[15];
    next if $age >= 15;
    my $year = @row[12];
    my $pop  = @row[19];
    dd $year, $age, $pop;
}

Cannot convert string to number: imaginary part of complex number
must be followed by 'i' or '\i' in '0-4⏏' (indicated by ⏏)

What? There’s imaginary numbers in here? Santa adds some debug output to print the line before processing it, and sees:

15,934,g,,,,5,Development group,902,"Less developed regions, excluding least developed countries",2,Medium,1950,1950,0-4,0,5,113433.383,107834.33,221267.713

Not so simple

Ah, biscuits. Looks like our horribly simple start caught up with us, we do have to care about more complicated CSV data after all.

Rather than spending any more time on improving our CSV “parser” (currently only split), let’s get out the big hammer:

$ zef install Text::CSV

Santa quickly checks out the docs and updates his code:

use Text::CSV;

my $csv = Text::CSV.new;
my $io = open "data.csv", :r;

my @headers = $csv.header($io).column-names;

while (my @row = $csv.getline($io)) {
    my $age = @row[15];
    next if $age >= 15;
    my $year = @row[12];
    my $pop  = @row[19];
}

He’s still using column numbers, but now that he’s switched over to Text::CSV, at least we can process the whole file.

Speed?

Problem with this version is it’s a little slow. To be fair, it is over 900,000 lines with 20 colums of CSV data. Santa is willing to cheat a little here: he’s just looking for some estimates, after all.

Maybe the Text::CSV has to do enough extra processing per line that it adds up, or maybe Raku’s default line iteration is more efficient than manually calling getline a bunch of times.

We’re impatient, so we’ll try addressing both at once: .lines to walk through the file, and then only using the CSV parser if it we know we got the wrong column count back. We may miss a line or two but this is good enough for our rough estimate. Santa adds up all the data for each year and prints out some samples.

use Text::CSV;

my $csv = Text::CSV.new;

my @lines = "data.csv".IO.lines;
my $headers = @lines.shift.split(',');
my $cols = $headers.elems;

my %estimate;
for @lines {
   my @row = $_.split(','); # simple CSV
   if @row.elems != $cols {
       @row = $csv.getline($_); # real CSV
   }
   my $year = @row[12];
   next if $year <= 2023;
   my $age = @row[15];
   next if $age >= 15;
   my $pop = @row[19];
   %estimate{$year}+=$pop; 
}
say %estimate{2024};
say %estimate{2050};

19110349.077
19204147.428

Ah, much better. Now we can see we can expect a few more deliveries in 2050! Let’s improve the formatting a little and filter to output each decade and see how much we need to expand!

Pretty print

use Text::CSV;

my $csv = Text::CSV.new;

my @lines = "data.csv".IO.lines;
my $headers = @lines.shift.split(',');
my $cols = $headers.elems;

my %estimate;
for @lines {
   my @row = $_.split(','); # simple CSV
   if @row.elems != $cols {
       @row = $csv.getline($_); # real CSV
   }
   my $year = @row[12];
   next if $year <= 2023;
   next unless $year %% 10; 
   my $age = @row[15];
   next if $age >= 15;
   my $pop = @row[19];
   %estimate{$year} += $pop; 
}

for %estimate.keys.sort -> $year {
    say "$year: %estimate{$year}.fmt('%i')";
}

2030: 18838469
2040: 18926239
2050: 19204147
2060: 18816096
2070: 18281171
2080: 17819389
2090: 17111136
2100: 16315984

Oh! It’s a good thing we checked, looks like 2050 will be the peak, and then the projections go back down! Maybe we can avoid expanding the shop for a while!

Speed?

Even though we have our answer now, this still takes a few seconds to get through all the data, so one last round of changes! We can:

add some concurrency to race through the processing, we don’t care what order we process the data
use some Seq methods to deal with the first line of headers more cleanly
specify a type for the data we’re extracting
use a Mix instead of a Hash to handle the addition
change the logic a bit to grab all the data and only print what we want – makes it easier if we want to change our reporting later

use Text::CSV;

my $io      = "data.csv".IO;
my $headers = $io.lines.head.split(',');
my $cols    = $headers.elems;

my %estimate is Mix = $io.lines.skip.race(batch => 1024).map: {
    my @row = .split(','); # simple CSV
    if @row.elems != $cols {
        @row = Text::CSV.new.getline($_); # real CSV
    }
    my int $year = @row[12].Int;
    my int $age  = @row[15].Int;
    my int $pop  = @row[19].Int;
    $year => $pop if $year > 2023 && $age < 15;
}
for %estimate.keys.grep(* %% 10).sort -> $year {
    say "$year: %estimate{$year}.fmt('%i')";
}

This does a little more work in about 40% of the time of the previous version since Santa made the work happen on multiple cores!

Other improvements?

Having gotten the quick answer he was looking for, Santa throws together a TODO file for next year’s estimator script:

Pull the file from the UN and unzip it in code if we haven’t already – and see if there’s an updated file name each year
Switch to a full Text::CSV version and figure out the best API to use for parallel processing. If we ever get embedded newlines in this CSV file, our cheat won’t work!
Use column headers instead of numbers to future proof against changes in the data file!
Wrap this into a MAIN sub so we can pass in the config we have hardcoded in the script

Wrapup

Now that Santa’s exercised his brain on this code, he’s ready to get back to the real work for the season!

Santa’s recommendation to you is to write some “horrible” Raku code, just like Matt Parker would. Of course, it’s not actually horrible, more “quick and dirty”. Remember, it’s OK to write something that just gets the job done, and not start with something polished.

It’s OK if you don’t necessarily understand all the nuances of the language (it’s big!), you just need enough to get the job done. You can always go back later and polish or iteratively improve it.

Raku even has this attitude baked in with gradual typing – you can add type strictures as you need. Much like writing a blog post, it’s easier to start with something and revise it than it is to face that blank file.

Remember, when optimizing your project, sometimes it’s OK to optimize for developer time!

The Problem

Filtering

Not so simple

Speed?

Pretty print

Speed?

Other improvements?

Wrapup

Share this:

Related

Leave a comment Cancel reply