Why Is the Futhark Compiler so Bad at Inlining?

Posted on October 28, 2024

Quick answer: because it was not necessary to do better for the research we wanted to do. The remainder of this post is an elaboration on this answer, and how it may change in the future.

Inlining is an optimisation by which a call to a procedure is replaced by the body of the procedure. The advantage is twofold:

Elimination of procedure call overhead, usually pushing and popping values from the stack.
The procedure body may be optimised in the context of which it occurs. For example, if one of the actual arguments is a constant, inlining may provide opportunities for constant folding.

Except in the case of very small functions, advantage (2) is the important one. Inlining is a so-called enabling optimisation that does not itself provide much benefit, but can allow other optimisations to apply. Indeed, it is perhaps the most important of all enabling optimisations, as most compilers are not good at optimising across procedure boundaries.

The main disadvantage of inlining is that it makes the program larger. This not just has the obvious detrimental effect on compile times, but may also inhibit runtime performance as the increased code size causes pressure on the instruction cache.

Whether or not to inline a function is not at all obvious, and the consequences of doing the wrong thing can be quite substantial, as crucial optimisations fail to apply. In practice, compilers use complicated and opaque heuristics, combined with programmer-provided annotations, to decide when to inline.

When we started the Futhark project, long before it was intended to be useful, we wanted to study two main program transformations: fusion and flattening. Both of these are large scale restructuring of program control flow, and they cannot easily cross function boundaries. While interprocedural variants might eventually merit study, this was not our initial goal. As a result, we picked a very simple inlining strategy: inline every function application. This allowed us to conduct our research without having to face the thorny problem of when to inline.

Interestingly, because Futhark programs tend to be rather small, this aggressive inlining policy turned out to work fine for most programs. While we did eventually add some attributes for controlling inlining, they were used very rarely. One reason is that they have quite sharp edges: the GPU backends in particular simply do not support calls of certain functions in certain positions, in particular, any function that allocates memory within a GPU kernel must be inlined in order for memory expansion to take place, and this is not really a code generation detail that is exposed in the source language, so it is difficult to reason about.

However, inlining everything is clearly not going to cut it as Futhark becomes useful for larger programs. As a result, I spent some time earlier this year on slightly refining our inlining strategy. Let me temper expectations: it is still extremely aggressive. Specifically, we now inline:

Any function or application specifically marked with the #[inline] attribute.
Any function that is only applied once.
Any application of a function that creates arrays or contains any parallel operations, if that application is itself inside a parallel operation.
Any function that is used inside an AD operator (because our AD transformation cannot handle applications yet).
As is tradition, any function that satisfies a strange and opaque heuristic that essentially compares the number of parameters and size of the return type with the size of the function body.

At a high level, the two kinds of functions that make it through the gauntlet without getting inlined tend to be in one of two categories:

Large top-level functions that constitute significant subprograms.
Scalar leaf functions, typically those that implement some nontrivial (but sequential) formula.

While not inlining the functions in category 1 can sometimes cost us opportunities for fusion, their impact on compilation time is substantial. The functions in category 2 are usually not subject to the optimisations that the Futhark compiler is particularly good at, and so not inlining them is typically fine. Also, since the code generated by Futhark is consumed by downstream optimising compilers, such functions may still end up being inlined by compilers that have a better idea of how to micro-optimise for a specific machine.

While we do not systematically track compile times for the Futhark compiler (yes, yes, I know), I did do some ad-hoc measurements when I refined our inlining strategy. A medium-sized program such as heston32 had its compile time cut in half, with no measurable impact on run-time performance. Generally, most programs saw no significant change in their run-time behaviour.

In the future, I would like to investigate interprocedural fusion. Our fusion algorithm is specified as a graph transformation, and it seems reasonable to treat function application as just another fusible node, with its fusion properties determined by its body, and fusion implicitly performing inlining when appropriate.