Custom tuning parameters - a dubious feature
Yesterday I wrote about the newest Futhark release, but while I highlighted the highly appreciated contributions of students and external contributors, I intentionally left out a somewhat dubious language feature of my own devise. That, however, is the topic of this post.
Tuning parameters
The code generated by the Futhark compiler makes various dynamic decisions on
how to run the code most efficiently. The most fancy example is incremental
flattening, which, for a given instance of
nested parallelism in the source program, will generate multiple code versions
that produce the same result, but exploit the parallelism differently. As a
simple example, for an expression map f xs where f is a function that
contain its own parallel constructs, perhaps it is best to launch a thread per
iteration of the map and simply run f sequentially, or to exploit the
parallelism in f as well. The underlying tension is that exploiting
parallelism is a good idea if it allows you to fully exploit the computation
resources available, but once that happens, further parallelism is simply
overhead. At run-time, Futhark decides between different versions by comparing
the amount of parallelism in the outer level (that is, the number of iterations
in the map) with a threshold parameter that is some measure for “how much
parallelism is needed to saturate the GPU”. The optimal value for this threshold
parameter is highly dependent on both the application and hardware, and can only
be found through autotuning (paper on the
technique).
Thresholds are an example of what we call a tuning parameter - a knob you can turn that may have performance consequences, but do not affect the value you get out of running a program. Although thresholds are the most prominent, Futhark exposes various other tuning parameters, for controlling the GPU grid size, thread block size, number of CPU threads, chunk sizes, cache sizes, and so on. These all have reasonable defaults set by the runtime system, but can be manually tweaked if needed. Tuning parameters are set at program startup via command line options, or by passing in a tuning file.
But why should code generated by the compiler have all the fun? It cannot be expected that the Futhark compiler does everything for you, and people who implement algorithms in Futhark also often end up having parameters such as block sizes or sampling rates that have no semantic effect, but can influence the performance. You can of course expose these knobs as ordinary parameters, but that can easily lead to unwieldy APIs. Since Futhark already had a run-time mechanism for tuning parameters, I reasoned that it might be useful for Futhark programs to be able to define their own, with largely the same usage in mind - a hidden, implicit parameter that can be set at program startup, but otherwise has a reasonable default.
Our design for custom tuning parameters
Although conceptually simple, designing custom tuning parameters turned out to be a bit tricky. Each tuning parameter should have two main properties:
- A name.
- A default value of type
i64.
The restriction to i64 is because that is what the existing run-time
infrastructure supports, and I did not want to change it.
There is some interesting prior work here. Chapel, a
PGAS language for HPC uses, supports a feature called configuration
parameters,
which are global variables that can be overridden using
“implementation-dependent means”, which in practice means command-line options
and configuration files. I generally consider Chapel to be a really good and
inspiring design, which is carefully tailored to its intended HPC domain.
Configuration parameters are a good example: while it is likely not flexible
enough to be useful for full general-purpose programming, it makes it very easy
to expose configuration knobs without having to worry about the details of
parsing command line parameters, generating help text, and so on. I was very
tempted to copy the design of Chapel’s configuration parameters entirely, which
consists largely of putting config in front of a global variable definition.
It may still turn out that it was a mistake not to copy Chapel’s design.
For Futhark, I wanted to avoid adding new syntax, perhaps I am unsure whether this design will persist. I also wanted the default value of a parameter to be able to depend dynamically on input data - which I honestly don’t know is a good idea or not.
My first idea for the design was to add a new intrinsic function such as
param, which would be used in the form param("foo", 123). This would then
return 123 unless the tuning parameter "foo" had been overridden by the
user. Since param would have the type param : []u8 -> i64 -> i64, this
enforces that the tuning parameter is a 64-bit integer. Unfortunately, that
[]u8 in the type is a bit of a problem - Futhark does not have strings, merely
some syntactic sugar for byte arrays, and there is no way to require that the
first argument to param is a string literal. What should occur for cases like
param(e, 123) where e is some complicated expression? We don’t want to
evaluate arbitrary code just to determine the name of the tuning parameter! And
therefore we cannot use an intrinsic function for this.
Instead the design is based on
attributes, through which expressions
can have arbitrary metadata associated with them. This also seems appropriate:
tuning parameters are supposed to be “non-semantic”, in the sense that their
value is not supposed to affect the result of the program, in the same way that
attributes can also be freely ignored from a semantic perspective. Tuning
parameters are exposed via the attribute #[param(param_name)], which, when
attached to an expression of type i64, defines a tuning parameter named
param_name, with the default value taken from that expression. Example:
def block_size : i64 = #[param(BLOCKSIZE)] 32Since the attribute can be attached to arbitrary expressions, we can generate the default value from properties of the input data. For example, maybe it is useful for the block size to be the square root of the length of some input array.
One downside of this design is that there is nothing that prevents you from
putting the param attribute on an expression that is not of type i64. In
principle the type checker could detect this, but for perhaps foolish
ideological reasons, I am reluctant to make the type checker too aware of
attributes. In practice, the compiler silently ignores attributes on expressions
of an invalid type.
Another problem is that there is nothing that prevents different libraries from defining tuning parameters of the same name, which will then conflict. In practice, it means that setting one tuning parameter will also set the other, although they can have different default values. The root problem here is that tuning parameters inhabit a single global namespace, and it is not quite clear to me how that can be resolved. One wrinkle is that Futhark’s ML-style module system means that application of parameterised modules means that one definition of a tuning parameter in the program text may in fact be duplicated any number of times, and it’s not really clear which of these should have the name that “wins”.
One problem, which really isn’t, is that there is nothing that requires programmes to use tuning parameters only for semantically transparent values. You can use the value you get back for whatever you want. Indeed, our tests for this feature depend quite obviously on being able to observe their values. That’s just one of those sharp edges, I suppose.
Performance properties
The easiest way to implement tuning parameters is purely dynamically, such that
when we compile an expression #[param(BLOCKSIZE)] e, we emit code that checks
in some context structure whether BLOCKSIZE has been set and if so return that
value, or otherwise execute and return the value of e. But tuning parameters
are often used in cases where it would be very useful if the compiler could
know their specific value when generating code, for example to optimise memory
allocation or unroll loops. For Futhark’s GPU backends, this can actually be
done easily enough: Futhark essentially does JIT compilation of GPU
kernels at program startup, and since the
tuning parameters are already set by then, we can just inline them directly
before invoking the kernel compilers. This is a little more tricky for the host
side of the program, since that is C code that is compiled just once.
One solution I have considered is the ability to pass values for tuning parameters to the Futhark compiler itself, where they will then be inlined whenever they are used. The downside of this approach is that it makes the “tuning” part of tuning parameters much more time-consuming, as you will need to completely recompile the program whenever you want to try a new value. Perhaps a mixture would be a good compromise: by default, tuning parameters are a purely run-time mechanism to enable swift iteration, but optionally they can be passed to the compiler and baked in more thoroughly. It may of course be that in some cases the performance advantages only occur when they are passed to the compiler, and so the tuning speed advantage of dynamic tuning parameters is pointless.
Takeaway
Despite not yet having used them for anything real, I remain convinced that custom tuning parameters are a useful feature in principle. I am however unsure that sneaking them in through attributes is the right way - it feels like I am afraid to commit, and I normally detest cowardice in language design. I think we also need to further improve the performance potential of the implementation, but I should probably implement one of those so-called “use cases” first…