Day 22: He’s making a list… (part 1)

If there’s anything that Santa and his elves ought to know, it’s how to make a list. After all, they’re reading lists that children send in, and Santa maintains his very famous list. Another thing we know is that Santa and his elves are quite multilingual.

So one day one of the elfs decided that, rather than hand typing out a list of gifts based on the data they received (requiring elves that spoke all the world’s languages), they’d take advantage of the power of Unicode’s CLDR (Common Linguistic Data Repository). This is Unicode’s lesser-known project. As luck would have it, Raku has a module providing access to the data, called Intl::CLDR. One elf decided that he could probably use some of the data in it to automate their list formatting.

He began by installing Intl::CLDR and played around with it in the terminal. The module was designed to allow some degree of exploration in a REPL, so the elf did the following after reading the provided read me:

# Repl response
use Intl::CLDR; # Nil
my $english = cldr<en> # [CLDR::Language: characters,context-transforms,
# dates,delimiters,grammar,layout,list-patterns,
# locale-display-names,numbers,posix,units]

The module loaded up the data for English and the object returned had a neat gist that provides information about the elements it contains. For a variety of reasons, Intl::CLDR objects can be referenced either as attributes or as keys. Most of the time, the attribute reference is faster in performance, but the key reference is more flexible (because let’s be honest, $english{$foo} looks nicer than $english."$foo"(), and it also enables listy assignment via e.g. $english<grammar numbers>).

In any case, the elf saw that one of the data points is list-patterns, so he explored further:

# Repl response
$english.list-patterns; # [CLDR::ListPatterns: and,or,unit]
$english.list-patterns.and; # [CLDR::ListPattern: narrow,short,standard]
$english.list-patterns.standard; # [CLDR::ListPatternWidth: end,middle,start,two]
$english.list-patterns.standard.start; # {0}, {1}
$english.list-patterns.standard.middle; # {0}, {1}
$english.list-patterns.standard.end; # {0}, and {1}
$english.list-patterns.standard.two; # {0} and {1}

Aha! He found the data he needed.

List patterns are catalogued by their function (and-ing them, or-ing them, and a unit one designed for formatting conjoined units such as 2ft 1in or similar). Each pattern has three different lengths. Standard is what one would use most of the time, but if space is a concern, some languages might allow for even slimmer formatting. Lastly, each of those widths has four forms. The two form combines, well, two elements. The other three are used to collectively join three or more: start combines the first and second element, end combines the penultimate and final element, and middle combines all second to penultimate elements.

He then wondered what this might look like for other languages. Thankfully, testing this out in the repl was easy enough:

my &and-pattern = { cldr{$^language}.list-patterns-standard<start middle end two>.join: "\t"'" }
# Repl response (RTL corrected, s/\t/' '+/)
and-pattern 'es' # {0}, {1} {0}, {1} {0} y {1} {0} y {1}
and-pattern 'ar' # ‮{0} و{1} {0} و{1} {0} و{1} {0} و{1}
and-pattern 'ko' # {0}, {1} {0}, {1} {0} 및 {1} {0} 및 {1}
and-pattern 'my' # {0} - {1} {0} - {1} {0}နှင့် {1} {0}နှင့် {1}
and-pattern 'th' # {0} {1} {0} {1} {0} และ{1} {0}และ{1}

He quickly saw that there was quite a bit of variation! Thank goodness someone else had already catalogued all of this for him. So he went about trying to create a simple formatting routine. To begin, he created a very detailed signature and then imported the modules he’d need.

#| Lengths for list format. Valid values are 'standard', 'short', and 'narrow'.
subset ListFormatLength of Str where <standard short narrow>;
#| Lengths for list format. Valid values are 'and', 'or', and 'unit'.
subset ListFormatType of Str where <standard short narrow>;
use User::Language; # obtains default languages for a system
use Intl::LanguageTag; # use standardized language tags
use Intl::CLDR; # accesses international data
#| Formats a list of items in an internationally-aware manner
sub format-list(
+@items, #= The items to be formatted into a list
LanguageTag() :$language = user-language #= The language to use for formatting
ListFormatLength :$length = 'standard', #= The formatting width
ListFormatType :$type = 'and' #= The type of list to create
) {
...
...
...
}

That’s a bit of a big bite, but it’s worth taking a look at. First, the elf opted to use declarator POD wherever it’s possible. This can really help out people who might want to use his eventual module in an IDE, for autogenerating documentation, or for curious users in the REPL. (If you type in ListFormatLength.WHY, the text “Lengths for list format … and ‘narrow’” will be returned.) For those unaware of declarator POD, you can use either #| to apply a comment to the following symbol declaration (in the example, for the subset and the sub itself), or #= to apply it to the preceeding symbol declaration (most common with attributes).

Next, he imported two modules that will be useful. User::Language detects the system language, and he used it to provide sane defaults. Intl::LanguageTag is one of the most fundamental modules in the international ecosystem. While he wouldn’t strictly need it (we’ll see he’ll ultimately only use them in string-like form), it helps to ensure at least a plausible language tag is passed.

If you’re wondering what the +@items means, it applies a DWIM logic to the positional arguments. If one does format-list @foo, presumably the list is @foo, and so @items will be set to @foo. On the other hand, if someone does format-list $foo, $bar, $xyz, presumably the list isn’t $foo, but all three items. Since the first item isn’t a Positional, Raku assumes that $foo is just the first item and the remaining positional arguments are the rest of the items. The extra () in LanguageTag() means that it will take either a LanguageTag or anything that can be coerced into one (like a string).

Okay, so with that housekeeping stuff out of the way, he got to coding the actual formatting, which is devilishly simple:

my $format = cldr{$language}.list-format{$type}{$length};
my ($start, $middle, $end, $two) = $format<start middle end two>;
if @items > 2 { ... }
elsif @items == 2 { @items[0] ~ $two ~ @items[1] }
elsif @items == 1 { @items.head }
else { '' }

He paused here to check and see if stuff would work. So he ran his script and added in the following tests:

# output
format-list <>, :language<en>; # ''
format-list <a>, :language<en>; # 'a'
format-list <a b>, :language<en>; # 'a{0} and {1}b'

While the simplest two cases were easy, the first one to use CLDR data didn’t work quite as expected. The elf realized he’d need to actually replace the {0} and {1} with the item. While technically he should use subst or similar, after going through the CLDR, he realized that all of them begin with {0} and end with {1}. So he cheated and changed the initial assignment line to

my $format = cldr{$language}.list-format{$type}{$length};
my ($start, $middle, $end, $two) = $format<start middle end two>.map: *.substr(3, *-3);

Now he his two-item function worked well. For the three-or-more condition though, he had to think a bit harder how to combine things. There are actually quite a few different ways to do it! The simplest way for him was to take the first item, then the $start combining text, then join the second through penutimate, and then finish off with the $end and final item:

if @items > 2 {
~ $items[0]
~ $start
~ $items[1..*-2].join($middle)
~ $end
~ $items[*-1]
}
elsif @items == 2 { @items[0] ~ $two ~ @items[1] }
elsif @items == 1 { @items.head }
else { '' }

Et voilà! His formatting function was ready for prime-time!

# output
format-list <>, :language<en>; # ''
format-list <a>, :language<en>; # 'a'
format-list <a b>, :language<en>; # 'a and b'
format-list <a b c>, :language<en>; # 'a, b, and c'
format-list <a b c d>, :language<en>; # 'a, b, c, and d'

Perfect! Except for one small problem. When they actually started using this, the computer systems melted some of the snow away because it overheated. Every single time they called the function, the CLDR database needed to be queried and the strings would need to be clipped. The elf had to come up with something to be a slight bit more efficient.

He searched high and wide for a solution, and eventually found himself in the dangerous lands of Here Be Dragons™, otherwise known in Raku as EVAL. He knew that EVAL could potentially be dangerous, but that for his purposes, he could avoid those pitfalls. What he would do is query CLDR just once, and then produce a compilable code block that would do the simple logic based on the number of items in the list. The string values could probably be hard coded, sparing some variable look ups too.

There be dragons here 🐉🦋

EVAL should be used with great caution. All it takes is one errant unescaped string being accepted from an unknown source and your system could be taken. This is why it requires you to affirmatively type use MONKEY-SEE-NO-EVAL in a scope that needs EVAL. However, in situations like this, where we control all inputs going in, things are much safer. In tomorrow’s article, we’ll discuss ways to do this in an even more safer manner, although it adds a small degree of complexity.

Back to the regularly scheduled program

To begin, the elf imagined his formatting function.

sub format-list(+@items) {
if @items > 2 { @items[0] ~ $start ~ @items[1..*-2].join($middle) ~ $end ~ @items[*-1] }
elsif @items == 2 { @items[0] ~ $two ~ @items[1] }
elsif @items == 1 { @items[0] }
else { '' }
}

That was … really simple! But he needed this in a string format. One way to do that would be to just use straight string interpolation, but he decided to use Raku’s equivalent of a heredoc, q:to. For those unfamiliar, in Raku, quotation marks are actually just a form of syntactic sugar to enter into the Q (for quoting) sublanguage. Using quotation marks, you only get a few options: ' ' means no escaping except for \\, and using " " means interpolating blocks and $-sigiled variables. If we manually enter the Q-language (using q or Q), we get a LOT more options. If you’re more interested in those, you can check out Elizabeth Mattijsen’s 2014 Advent Calendar post on the topic. Our little elf decided to use the q:s:to option to enable him to keep his code as is, with the exception of having scalar variables interpolated. (The rest of his code only used positional variables, so he didn’t need to escape!)

my $format = cldr{$language}.list-format{$type}{$length};
my ($start, $middle, $end, $two) = $format<start middle end two>;
my $code = q:s:to/FORMATCODE/;
sub format-list(+@items) {
if @items > 2 { @items[0] ~ $start ~ @items[1..*-2].join($middle) ~ $end ~ @items[*-1] }
elsif @items == 2 { @items[0] ~ $two ~ @items[1] }
elsif @items == 1 { @items[0] }
else { '' }
}
FORMATCODE
EVAL $code;

The only small catch is that he’d need to get a slightly different version of the text from CLDR. If the text and were placed verbatim where $two is, that block would end up being @items[0] ~ and ~ @items[1] which would cause a compile error. Luckily, Raku has a command here to help out! By using the .raku function, we get a Raku code form for most any object. For instance:

# REPL output
'abc'.raku # "abc"
"abc".raku # "abc"
<a b c>.raku # ("a", "b", "c")

So he just changed his initial assignment line to chain one more method (.raku):

my ($start, $middle, $end, $two) = $format<start middle end two>.map: *.substr(3,*-3).raku;

Now his code worked. His last step was to find a way to reuse it to benefit from this initial extra work.He made a very rudimentary caching set up (rudimentary because it’s not theoretically threadsafe, but even in this case, since values are only added, and will be identically produced, there’s not a huge problem). This is what he came up with (declarator pod and type information removed):

sub format-list (+@items, :$language 'en', :$type = 'and', :$length = 'standard') {
state %formatters;
my $code = "$language/$type/$length";
# Get a formatter, generating it if it's not been requested before
my &formatter = %cache{$code}
// %cache{$code} = generate-list-formatter($language, $type, $length);
formatter @items;
}
sub generate-list-formatter($language, $type, $length --> Sub ) {
# Get CLDR information
my $format = cldr{$language}.list-format{$type}{$length};
my ($start, $middle, $end, $two) = $format<start middle end two>.map: *.substr(3,*-3).raku;
# Generate code
my $code = q:s:to/FORMATCODE/;
sub format-list(+@items) {
if @items > 2 { @items[0] ~ $start ~ @items[1..*-2].join($middle) ~ $end ~ @items[*-1] }
elsif @items == 2 { @items[0] ~ $two ~ @items[1] }
elsif @items == 1 { @items[0] }
else { '' }
}
FORMATCODE
# compile and return
use MONKEY-SEE-NO-EVAL;
EVAL $code;
}

And there he was! His function was all finished. He wrapped it up into a module and sent it off to the other elves for testing:

format-list <apples bananas kiwis>, :language<en>; # apples, bananas, and kiwis
format-list <apples bananas>, :language<en>, :type<or>; # apples or bananas
format-list <manzanas plátanos>, :language<es>; # manzanas y plátanos
format-list <انارها زردآلو تاریخ>, :language<fa>; # انارها، زردآلو، و تاریخ

Hooray!

Shortly thereafter, though, another elf took up his work and decided to go even crazier! Stay tuned for more of the antics from Santa’s elves how they took his lists to another level.

3 thoughts on “Day 22: He’s making a list… (part 1)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: