Fork me on GitHub

Futhark 0.21.13 released

Posted on July 7, 2022 by Troels Henriksen

Mighty software isn’t released, rather it escapes, and so Futhark 0.21.13 just did (full changelog here). In contrast to most of the releases since the last release I bothered writing a blog post about, this one has important user-visible improvements. I will merely summarise most of the changes since the last announcement, either because they are small or because I already wrote about them, but one of them (an entirely new backend for vectorised CPU code) deserves a longer exposition.

The summarised changes are as follows.

A new backend

This Futhark release comes with a new parallel compiler backend courtesy of the students Pema Malling, Louis Marott Normann, Oliver Petersen, and Kristoffer Kortbæk, who implemented it for their bachelor’s thesis. (See a pattern? Also, thesis here.) This backend is an extension of the multicore CPU backend that Minh Duc Tran implemented (as his master’s thesis, as you might by now expect) to make better use of the SIMD instructions that are supported by modern CPUs.

To understand how the new backend works, first let’s recap how the existing multicore backend works in simplified terms. Ultimately the question is how to execute in parallel a small collection of built-in parallel primitives, roughly corresponding to generalised variants of the source-language functions map, reduce, scan, and a few others. Consider map as the simplest case. Every element is independent of the others, so in principle you could start a new thread for each one in order to maximise parallelism. In practice, the overhead of such maximal parallelism is not worth it. Your CPU does not have a million cores - it has perhaps 16. So instead, we split the array being maped into a number of chunks, assign a chunk to each CPU core, after which each thread performs a sequential map on its chunk.

Even though we have now decided not to exploit thread parallelism inside each chunk, we are semantically still dealing with map, and each iteration is still independent. Modern CPUs have SIMD instructions that allow data parallelism - performing the same operation on different data - which we can use to execute the otherwise sequential maps. The performance advantage of using SIMD instructions are very context-dependent, but at a high level, it seems clearly better to have one instruction working on multiple pieces of data at once. In principle, an SIMD addition instruction operating on vectors of four numbers each should be four times faster than having to execute four distinct addition instructions.

In principle, a C compiler could automatically vectorise the loops produced by the existing multicore backend and make them use SIMD instructions. In practice, auto-vectorisation is quite fragile, so it doesn’t happen except for very simple cases. Ironically, I’m not a big fan of black-box compiler optimisations, especially when they are fragile. Thus the core idea of the new backend: instead of processing each chunk sequentially within a thread, uses SIMD instructions to further squeeze out a bit of performance.

Another person who doesn’t much like auto-vectorisation is Matt Pharr. In fact, he found auto-vectorisation so fragile that he invented a new programming language called ISPC in order to make vectorisation more reliable. ISPC is a low-level data-parallel language like CUDA, in which multiple program instances are active simultaneously, and where vectorisation cannot fail. It is a property of the programming model, rather than a black-box compiler optimisation. The new Futhark backend, tellingly named ispc, generates ISPC code for the per-chunk loops (the sequential maps). This makes it the responsibility of the ISPC compiler itself to select the right SIMD instructions for a given CPU - this is the kind of stuff we really don’t have the resources to maintain ourselves. One disadvantage is that ISPC isn’t really intended as a compilation target, but students fortunately have high pain tolerance, so they managed to make it work - even very exotic stuff, such as dynamic bounds checking.

So what is performance like? It’s a bit unpredictable. ISPC is very sensitive to static memory access patterns and control flow, and the students solely focused on backend work. The input code they are compiling is produced by the existing multicore pipeline, and is not at all optimised for SIMD execution in terms of memory layout and nested parallelism. On mandelbrot I get about 1.67x speedup on my laptop with a Ryzen 4750 CPU compared to the multicore backend. Not much, but it’s free performance - all it requires is passing --backend=ispc to the Futhark compiler.

On crystal I get a speedup of 6.78x, which is far more than I would expect. The probable cause is that ISPC uses CPU intrinsics for trigonometric operations such as sines and cosines, while a C compiler will normally make a function call to the math library.

On Barnes-Hut n-body, the ispc backend sees a slowdown of 0.62x compared to multicore. This is because the core of this program is an irregular romp through a tree structure, which is poorly suited to SIMD execution - or at least the kind of SIMD code we currently generate.

We hope to improve these results by adding dedicated optimisation passes to the compilation pipeline used for the ispc backend. If you’re a student, feel free to hit us up!

The fruits of forced labour

Exploiting students as a source of free labour has long been part of the Futhark development process. It’s also not unusual in academia in general, and jokes about dusty decks, developed by generations of students, are common. The punchline tends to be about the code quality of these systems, which tend to be monstrosities developed by accretion as each contributor added whatever they required for their immediate needs, without any consideration for overall design. Many research compilers are exactly like this, and ever since we started development of Futhark, I was terrified that we would end up like that.

While Futhark has received many contributions from students, I think we have managed to avoid these problems. The overall code quality of the compiler is still pretty high, judging by the ease with which we can still make changes and the relatively low bug density quantity. Further, the bugs we do have tend to be localised logic errors, rather than unintended interactions between distinct components. I have a few guesses for how we managed this: