Fork me on GitHub

Futhark 0.13.1 released

Posted on November 27, 2019 by Troels Henriksen

Today I released version 0.13.1 of the Futhark compiler. The true motivation was the fix of a few bugs that I worried would be encountered by the students currently taking our course on Parallel Functional Programming at DIKU. Since the last release announcement there has actually been two unannounced releases - 0.12.2 and 0.12.3 - but they had so few interesting user-facing changes that I didn’t feel like writing announcements for them.

Since the new release is called 0.13.1 and not 0.12.4, there has been a breaking change. Fortunately, it’s a very trivial one that will not affect most users: when writing test cases for futhark test, you no longer write empty(t) for an empty array of element type t, instead you write empty([0]t). The motivation for this change is that the work on size types requires a more principled handling of empty arrays. In particular, even though one dimension of an array is empty, that does not mean all dimensions have to be empty. [0][2]i32 is a perfectly valid array type, and should be distinct from [2][0]i32.

Now, I did also add one interesting optimisation related to sum types. It turned out to have pretty small impact on most programs, but I still think it’s worth talking about, because I was unable to find anything like it in the literature. I was also unable to come up with a good name for it, so it has a pretty stupid one.

Load Sinking

Recall how sum types are represented at run-time in Futhark: the constructor is turned into an 8-bit integer (i8), and the payload types are just concatenated. For example, the type

type sum = #foo i32 | #bar bool

will be represented as (i8, i32, bool). A pattern match

match v
case #foo x -> f x
case #bar y -> g y

is then compiled into if-then-else:

let (c, x, y) = v
in if c == 0
   then f x
   else g y

So far, so good. Now, what happens if we have an array of sum types? Well, that will turn into an array of tuples, and in Futhark, an array-of-tuples are represented as a tuple-of-arrays for low-level performance reasons. Continuing the example, an array []sum will be represented as ([]u8, []i32, []bool). Let’s say that the original source program has such an array, called sums, and is indexing that array and performing a pattern match on the result:

let v = sums[i]
in match v
   case #foo x -> f x
   case #bar y -> g y

After turning sums into tuples and arrays-of-tuples into tuples-of-arrays, we obtain the following:

let c = sums_cs[i]
let x = sums_xs[i]
let y = sums_ys[i]
in if c == 0
   then f x
   else g y

So, what’s the problem here? We are always reading both x and y, even though only one of them will ever be used! That means we have redundant accesses to memory, and memory is slow. It can become a significant problem when we have larger sum types with many constructors and large payloads, even taking into account the deduplication we do when constructors overlap in their payload type. In particular, this pattern is almost guaranteed to occur when maping across an array of sum types.

The solution I came up with is an optimisation I call “sinking” because it is the opposite of hoisting. Basically, for each instance of array indexing in the program, I check whether the result of the indexing is only used inside a later occurring if branch. If so, I move the index operation inside the branch. Applied to the program above, it produces:

let c = sums_cs[i]
in if c == 0
   then let x = sums_xs[i]
        in f x
   else let y = sums_ys[i]
        in g y

The program that motivated this optimisation was a ray tracer I worked on that contains arrays of relatively large sum types (the objects in the scene) that each desugar to over a dozen values. Unfortunately, while the optimisation had impact on contrived benchmarks, the ray tracer was mostly unaffected. I suspect two reasons:

  1. Our deduplication is quite good, so most values (e.g. the bounding box and material of the object) is used in all branches of the case.
  2. The ray tracer performs so many other memory accesses anyway, and is bottlenecked by other things.

Oh well. Some day, someone may write a program that would have been slightly slower if not for this optimisation.