Day 22 – The Magic Of hyper

After reading Santa’s horrible code, Lizzybel thought it might be time to teach the elves (and maybe Santa) a little bit about hypering and racing in the Raku Programming Language.

So they checked with all the elves that were in their latest presentation, to see if they would be interested in that. ”Sure“, one of them said, “anything that will give us some more free time, instead of having to wait for the computer!“ So Lizzybel checked the presentation room (still empty) and rounded up the elves that could make it. And started:

A simple case

Since we’re going to mostly talk about (wallclock) performance, I will add timing information as well (for what it’s worth: on an M1 (ARM) processor that does not support JITting).

Let’s start with a very simple piece of code, to understand some of the basics: please give me the 1 millionth prime number!

$ time raku -e 'say (^Inf).grep(*.is-prime)[999999]'
15485863
real 6.41s
user 6.40s
sys 0.04s

Looks like 15485863 is the 1 millionth prime number! What we’re doing here is taking the infinite list of 0 and all positive integers (^Inf), select only the ones that are prime numbers (.grep(*.is-prime)), put the selected ones in a hidden array and show the 1 millionth element ([999999]).

As you can see, that takes 6.41 seconds (wallclock). Now one thing that is happening here, is that the first 999999 prime numbers are also saved in an array. So that’s a pretty big array. Fortunately, you don’t have to do that. You can also skip the first 999999 prime numbers, and then just show the next value, which would be the 1 millionth prime number:

$ time raku -e 'say (^Inf).grep(*.is-prime).skip(999999).head'
15485863
real 5.38s
user 5.39s
sys 0.02s

This was already noticeable faster: 5.38s! That’s already about 20% faster! If you’re looking for performance, one should always look for things that are done needlessly!

This was still only using a single CPU. Most modern computers nowadays have more than on CPU. Why not use them? This is where the magic of .hyper comes in. This allows you to run a piece of code on multiple CPUs in parallel:

$ time raku -e 'say (^Inf).hyper.grep(*.is-prime).skip(999999).head'
15485863
real 5.01s
user 11.70s
sys 1.42s

Well, that’s… disappointing? Only marginally faster (5.38 -> 5.01)? And use more than 2 times as much CPU time (5.41 -> 11.70)?

The reason for this: overhead! The .hyper adds quite a bit overhead to the execution of the condition (.is-prime). You could think of .hyper.grep(*.is-prime) as .batch(64).map(*.grep(*.is-prime).Slip).

In other words: create batches of 64 values, filter out the prime numbers in that batch and slip them into the final sequence. That’s quite a bit of overhead compared to just checking each value for prime-ness. And it shows:

$ time raku -e 'say (^Inf).batch(64).map(*.grep(*.is-prime).Slip).skip(999999).head'  
15485863
real 9.55s
user 9.55s
sys 0.03s

That’s roughly 2x as slow as before.

Now you might ask: why the 64? Wouldn’t it be better if that were a larger value? Like 4096 or so? Indeed, that makes things a lot better:

$ time raku -e 'say (^Inf).batch(4096).map(*.grep(*.is-prime).Slip).skip(999999).head'   
15485863
real 7.75s
user 7.75s
sys 0.03s

The 64 is the default value of the :batch parameter of the .hyper method. If we apply the same change in size to the .hyper case, the result is a lot better indeed:

$ time raku -e 'say (^Inf).hyper(batch => 4096).grep(*.is-prime).skip(999999).head'          
15485863
real 1.22s
user 2.79s
sys 0.06s

That’s more than 5x as fast as the original time we had. Now, should you always use a bigger batch size? No, it all depends on the amount of work that needs to be done. Let’s take this extreme example, where sleep is used to simulate work:

$ time raku -e 'say (^5).map: { sleep $_; $_ }'   
(0 1 2 3 4)
real 10.18s
user 0.14s
sys 0.03s

Because all of the sleeps are executed consecutively, this obviously takes 10+ seconds, because 0 + 1 + 2 + 3 + 4 = 10. Now let’s add a .hyper to it:

$ time raku -e 'say (^5).hyper.map: { sleep $_; $_ }'
(0 1 2 3 4)
real 10.19s
user 0.21s
sys 0.05s

That didn’t make a lot of difference, because all 5 values of the Range are slurped into a single batch because the default batch size is 64. These 5 values are then processed, causing the sleeps to still be executed consecutively. So this is just adding overhead. Now, if we change the batch size to 1:

$ time raku -e 'say (^5).hyper(batch => 1).map: { sleep $_; $_ }'
(0 1 2 3 4)
real 4.19s
user 0.19s
sys 0.03s

Because this way all of the sleeps are done in parallel, we only need to wait just over 4 seconds for this to complete, because that’s the longest sleep that was done.

Question: what would the wallclock time at least be in the above example if the size of the batch would be 2?

In an ideal world the batch-size would adjust itself automatically to provide the best overhead / throughput ratio. But alas, that’s not the case yet. Maybe in a future version of the Raku Programming Language!

Racing

But what about the race method?“, asked one of the elves, “what’s the difference between .hyper and .race?“ Lizzybel answered:

The .hyper method guarantees that the result of the batches are produced in the same order as the order in which the batches were received. The .race method does not guarantee that, and therefore has slightly less overhead than .hyper.

You typically use .race if you put the result into a hash-like structure, like Santa did in the end, because hash-like structures (such as a Mix) don’t have a particular order anyway.

Degree

But what if I don’t want to use all of the CPUs” another elf asked. “Ah yes, almost forgot to mention that“, Lizzybel mumbled and continued:

By default, .hyper and .race will use all but one of the available CPUs. The reason for this is that you need one CPU to manage the batching and reconstitution. But you can specify any other value with the :degree named argument, and the results may vary:

$ time raku -e 'say (^Inf).hyper(batch => 4096, degree => 64).grep(*.is-prime).skip(999999).head'
15485863
real 1.52s
user 3.08s
sys 0.06s

That’s clearly slower (1.22 -> 1.52) and takes more CPU (2.79 -> 3.08).

Because the :degree argument really indicates the maximum number of worker threads that will be started. If that number exceeds the number of physical CPUs, then you will just have given the operating system more to do, shuffling the workload of many threads around on a limited number of CPUs.

Work

At that point Santa, slightly red in the face, opened the door of the presentation room and bellowed: “Could we do all of this after Christmas, please? We’re running behind schedule!“. All of the elves quickly left and went back to work.

The answer is: 5. Because the first batch (0 1) will sleep 0 + 1 = 1 second, the second batch (2 3) will sleep 2 + 3 = 5 seconds, and the final batch only has 4, so will sleep for 4 seconds.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.