In-place mapping and the pleasure of beautiful code nobody will ever see

Posted on December 6, 2022 by Troels Henriksen

Last month Philip wrote about our fancy new short-circuiting optimisation, which analyses the program and modifies it such that values are constructed in-place. The intent is to avoid the copies that usually result from functional programming, where a value is assembled in memory over there, and then afterwards copied to where it is needed over here. Just consider two separate expressions that both produce an array each, which are then concatenated. Short-circuiting modifies those two expressions such that the arrays are produced adjacent to each other in memory, meaning the concatenation becomes a no-op.

Short-circuiting is a very general technique, and the concatenation example is just one thing it can do. The original inspiration for this line of work came from Cosmin Oancea, the spiritual father of the Futhark project as a whole, who really wanted to have certain guarantees of the code that would be generated from a Futhark program. For example, for simple expressions like map f xs, he desired this map to be operationally in-place, meaning the result array should be stored in the same memory as that occupied by xs. This is possible when two things hold:

xs is not used afterwards (nor anything aliased with xs).
The type of the result is the same as the type of xs (put another way, f must have type t -> t for some t).

A compiler could check for this and make the map in-place when possible. Eventually this line of thinking grew into what became short-circuiting, which handles much more fancy cases (read the blog post or the paper). However, while attending SC22, where this paper was presented, I discovered that along the long and difficult journey to publishable results, we had actually forgotten our original simple motivating example: maps were not performed in-place!

Fortunately, Philip was able to quickly address the omission and support this optimisation within the framework of short-circuiting. It was nice to see that the fundamental approach was flexible enough to also cover this case, even though we had forgotten all about it.

But what is the impact of this supposed optimisation in practice? Consider a function like the following:

def main (xs: *[]i32) = map (+1) xs

As the parameter xs is marked unique, the function has ownership of it. This means that the map can be performed in-place, and indeed it is. This does not affect the total memory traffic: we still have to read and write one i32 for every array element. The only advantage of operating in-place is that the memory footprint is cut in half, as we do not need to allocate memory to store the result. Lowering memory usage can be a goal in itself, but it does not always significantly affect performance. But let’s see!

On an input array with 100 million elements, on an A100 GPU, the in-place map runs in 798μs, while the out-of-place version runs in 811μs. Essentially nil impact. However, if we ask Futhark to generate parallel CPU code, the in-place map runs in 7152μs and the out-of-place map in 15839μs. That means the in-place map is about 2x faster! The reason for this dramatic speedup is that when the memory footprint is lowered, we can fit more data in the CPU cache. While GPUs have caches as well, they are less important, and instead depend on latency hiding and high-throughput memory buses to obtain performance.

This is of course an absolutely best-case scenario, where the footprint is cut in half. The impact on real programs is smaller, although still noticeable. But the real reason I enjoy this optimisation is harder to quantify: it is aesthetic rather than numeric. I take pleasure in looking at the generated code and seeing this:

for (int64_t SegMap_i_6133 = start_6130; SegMap_i_6133 < start_6130 + n_6132; SegMap_i_6133++) {
  int64_t slice_6134 = dz2080U_6078;
  int64_t gtid_6120 = SegMap_i_6133;
  int64_t remnant_6135 = SegMap_i_6133 - gtid_6120;
  int32_t x_6106 = ((int32_t *) mem_6124.mem)[gtid_6120];
  int32_t defunc_0_f_res_6107 = add32(1, x_6106);
  ((int32_t *) mem_6124.mem)[gtid_6120] = defunc_0_f_res_6107;
}

Looking past the intermediate bindings and the compiler-chosen names, this just looks right. This is how I myself would write this program in C: read an element of the array, do something to it, and put it back where I got it from.

My colleague Martin Elsman maintains the MLKit compiler. He recently made some improvements to code generation that made it produce simpler assembly - things like avoiding superfluous moves and jumps-to-jumps. As expected, this had no performance impact on modern speculating, superscalar processors, but Martin expressed significant pleasure at reading the now-cleaner emitted code.

I wonder how many compiler developers feel this way. We often say that code is “written for humans to read” and therefore should be readable, but the code produced by a compiler is usually destined for nothing but execution. Still, I like knowing that the unseen transient code produced by the compiler contains a spark of elegance and simplicity, even if it is sometimes well hidden and no user will ever notice.