So, I’m a programmer and I work for a government TI “e-gov” department. My work here is mostly comprised of one-off data-integration tasks (like the one in this chronicle) and programming satellite utilities for our Citizen Relationship Management system.
the problem
So, suppose you have:
a lot (half a million) records in a .csv file, to be entered in your database;
a database only accessible via a not-controlled-by-you API;
said API takes a little bit more than half a second per record;
some consistency checks must be done before sending the records to the API; but
the API is a “black box” and it may be more strict than your basic consistency checks;
tight schedule (obviously)
the solution
the prototype: Text::CSV and HTTP::UserAgent
So, taking half a second per record just in the HTTP round-trip is bad, very bad (34 hours for the processing of the whole dataset).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
So, explaining the code above a little bit: @r is a lazy sequence (this means, roughly, that the while my $row bit in read-csv is executed one row at a time, in a coroutine-like fashion.) When I use .hyper(:$degree, :$batch), it transforms the sequence in a “hyper-sequence”, basically opening a thread pool with $degree threads and sending to each thread $batch itens from the original sequence, until its end.
Yeah, but HTTP::UserAgent does not parallelise very nicely (it just does not work)… besides, why the react whenever supply emit? It’s a mystery lost to the time. Was it really needed? Probably not, but the clock is always ticking, so just move along.
Cro::HTTP to the rescue
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Nice, but I ran the thing on a testing database and… oh, no… lots of 503s and eventually a 401 and the connection was lost.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oh, it ran almost to the end of the data (and it’s fast), but… we are getting some 409s for some records where our csv-to-json is not smart enough, we can ignore those records. And some timeouts.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
So, now the whole process goes smoothly and finishes in 20 minutes, circa100x faster.
Import the data in production… similar results. The process is ongoing, 15 minutes in, the Ops comes (in person):
Why is the server load triple the normal and the number of 5xx is thru the roof?
just five more minutes, check the ticket XXX, closing it now…
(unintelligible noises)
And this is the story of how to import half a million records, that would take two whole days to be imported, in twentysome minutes. The whole ticket took less than a day’s work, start to finish.
related readings
If you want to read more about Raku concurrency, past Advent articles that might interest you are:
One thought on “Day 7: .hyper and Cro”