The past and present of Futhark
We recently released Futhark 0.20.1, which introduces the def
keyword for top-level definitions. This means that all known
language design flaws have
now been fixed (as I no longer consider higher-order modules a design
flaw), and hence Futhark is now perfect. So what comes next? Will
the language never change again? Will we go work on some other
project instead? The answer to both questions is “no”, but this seems
like a good opportunity to discuss some of the original motivation
that lead to Futhark, and where we will go from here. I have
previously written a navel-gazing
narrative
about my own involvement, but Futhark’s past and probable future goes
beyond just me. Futhark is after all developed by a team with diverse
interests and motivations.
The genesis of Futhark came about when Cosmin Oancea was hired by the HIPERFIT project. HIPERFIT was a research project that investigated the use of high-performance functional programming to solve financial problems. As a postdoc, Cosmin had done a lot of work on automatic and speculative parallelisation of Fortran programs at Texas A&M and Cambridge, for example on the Polaris compiler. The compiler research community is full of ideas that have not been tried much in practice, because the time it takes to implement an idea can be very high, due to the friction involved in working within the framework of an existing compiler and language. Cosmin conceived of a tiny functional language, really not much more than a compiler IR, on which certain experimental compiler optimisations would be reasonably easy to perform. That the language was given a syntax at all was mostly a convenience for writing benchmark programs.
Over the years, that tiny language eventually became Futhark, which also grew to be much more than just an IR. Now it’s even downright pleasant to use! This transformation occurred because language design turned out to be quite fun. We truly did not want to design a new language, and even spent some time initially trying to find existing languages that we could use as a frontend to the compiler. But from time to time we added minor features to the language to ease the job of writing benchmark programs, and sometimes we realised that there was even interesting research problems when it comes to designing a language for efficient execution. Martin Elsman contributed significantly to adding features such as polymorphism, modules, and higher-order functions, with Anders Kiel Hovgaard contributing substantially to the latter.
While the original idea that the language isn’t the point is not strictly true anymore, we really are not trying to do any grand language design experiments. There are lots of very nice languages out there - Haskell is one of my own favourites - and they keep improving. Futhark isn’t as nice or as flexible as the best of them, and likely won’t ever be. But Futhark is hopefully faster. Our research challenge, from a language perspective, is to identify those language features that can help the compiler (or at least not hinder it), and are useful for the kinds of nested parallel number-crunching that Futhark is meant for. We also want the language to stay simple - my own yardstick is that Futhark should in aggregate be no more complicated than Standard ML.
So while the idea that Futhark is now perfect is of course facetious, it is indeed the case that significant innovation in language design is beyond the original scope of the project. The future of Futhark does involve further language improvements, but mostly more work - and research! - on the compiler side of things. For the rest of this post, I’ll cover some of the things I hope we’ll be working on in 2022.
Broadcasting
I’ve heard it said that “array programming means not having to write
map
”. Well, in Futhark you do need to write map
, and while it can
sometimes lead to useful clarity and flexibility, you end up writing
map
a lot. In particular, when transcribing linear algebra
formulae, the map
s become quite noisy. Recently I found myself
writing map2 (map2 (/)) x y
to element-wise divide two matrices,
which mathematically we would write as just x / y
.
Most statically typed array languages support this by having a notion of rank polymorphism in their type system. Single-Assigment C (SAC) is an example of this. In SAC, the division function would be defined as accepting arrays of any rank and shape (even different ones), with rules for how a value is then “broadcasted” across another value of higher rank. This allows us to mix arrays of different rank, such as when multiplying a vector with a scalar. In such a system, scalars are just arrays of rank zero. The venerable APL was probably the first language with such rules, and they have since become very common through libraries such as NumPy.
Rank polymorphism is a big hammer that interacts in complex ways with
parametric polymorphism and higher-order functions. I would very much
like to do a proper analysis on the pros and cons of the different
approaches one day, but I think Futhark irrevocably went down a
different path on this one. We don’t want rank polymorphic function
types, but we also don’t want to write map
quite as much. Is there
a middle ground?
Well, that is possibly a research question. We believe that we can
come up with a system where the compiler automatically inserts map
and replicate
operations “as necessary” to make the program
type-correct. For example, consider the case of dividing two matrices
element-wise. We’ll use a division function instead of an operator,
just to make the syntax simpler:
val div : f32 -> f32 -> f32
Suppose we have arrays xss : [n][m]f32
, yss : [n][m]f32
. The
expression div xss yss
is ill-typed, but it becomes well-typed
simply by wrapping div
in two map2
s:
map2 (map2 div) xss yss
Intuitively, this wrapping “lifts” a function to operate on values of
higher rank. Now suppose we have a scalar y : f32
and we wrote div xss y
. Again, this is ill-typed. But we can make it well-typed by
inserting map2
s to lift div
and some replicate
s to add extra
dimensions to y
:
map2 (map2 div) xss (replicate n (replicate m y))
The type of div
is still the simple f32 -> f32 -> f32
that
operates on scalars. Instead of supporting rank polymorphism in
functions, we support it for function applications. This is
certainly weaker than full rank polymorphism, but it is probably
powerful enough to remove the noise of map
s that plague many Futhark
programs. What I like is also that it’s ultimately just a shorthand -
you can write out the map
s and replicate
s yourself if you are
uncertain about what is going on, and the compiler can quite easily
explain what it has inferred, in terms of the same language that the
programmer is using, and without adding any new concepts to the type
system.
Questions left to answer:
It is sometimes ambiguous where and how to insert the
map
andreplicate
s. For example,id xss
,map id xss
, andmap (map id) xss
are all well-typed. Is a rule such as “always insert the minimal number ofmap
s” enough to disambiguate?What happens when higher-order functions become involved? How do we see that
yss |> div xss
should be turned intoyss |> map2 div xss
when the actual application is hidden in the definition of|>
?Which other languages have done similar things?
Automatic differentiation
Automatic differentiation (AD) is a program transformation that is used to compute the derivatives of an arbitrary function. We can view AD as a generalisation of the symbolic differentiation rules taught in high school calculus, extended to also work for programs with control flow, state, storage, etc. Computing the derivative of a function is a useful building block for function optimisation, where we are trying to find an x such that f(x) is minimised or maximised. For example, modern machine learning makes heavy use of gradient descent as part of backpropagation. While AD is not a new idea, the machine learning renaissance has caused a substantial growth in interest, as better AD techniques let us train more sophisticated models.
We (the Futhark developers, on Cosmin Oancea’s initiative) have spent a lot of 2021 implementing a quite effective AD transformation in the Futhark compiler. I expect to write a lot more about it once we make it available in an actual release, so I will not go into a lot of detail about how it works and what it’s useful for. What is interesting is that, so far, AD has not required any language extensions. Two higher-order functions serve as the API:
val jvp 'a 'b : (f: a -> b) -> (x: a) -> (x': a) -> b
val vjp 'a 'b : (f: a -> b) -> (x: a) -> (y': b) -> a
This is the “ideal interface” of Barak Pearlmutter and it is surprisingly flexible. To the compiler, these are of course deeply magical, and trigger AD transformation of the provided function. While a language such as Haskell is powerful enough to implement AD as a library, Futhark requires it to be baked into the compiler. One question is how Futhark’s AD support will fare compared to languages such as Dex. Futhark was co-designed with an optimising compiler, and so optimises well, but AD was never part of its original design. In contrast, Dex is designed by AD experts and specifically with AD in mind. It will be interesting to see whether we will ultimately be hindered by our language design. But so far, it actually works surprisingly well, and our embryonic AD benchmark suite shows both excellent run-time performance and a rather pleasant programming experience.
And honestly, if people managed to write useful AD transformations for Fortran, surely we can cobble something together for Futhark.
Accumulators
Futhark is a pure language, so side effects are tightly controlled:
there are none. (Except non-termination, if you are a purist.)
Things that would normally be done with side effects are instead
handled through other language constructs. For example, an “in-place
update”, where we want to change the value of one element of an array,
is written as xs with [i] = v
. Semantically, this gives you a
copy of xs
with a new value at index i
. Operationally, this is
implemented by an immediate (and cheap!) write to whatever memory is
actually storing xs
. To ensure this imperative write cannot be
observed as an effect, xs
and its aliases may not be used on any
execution path following the with
-expression. The type checker
enforces this via a combination of alias analysis and uniqueness
types.
After an expression xs with [i] = v
, we say that xs
has been
“consumed”. One important safety principle is that a value is only
consumed once. But consider an expression such as the following:
map (\y -> xs with [0] = y) ys
Since the function we pass to map
will be executed once per element
of ys
, we will end up consuming xs
multiple times. Not good.
When defining a function, even an anonymous one, we have no idea how
often it will be invoked. The solution is a rule, enforced by the
type checker, stating that we cannot consume any variable not bound
within the “current function”, since that might otherwise result in
the variable being consumed multiple times. This rigid rule has not
really been a problem so far, but it may be about to become so.
Because with
-expressions don’t mesh well with higher-order
functions, Futhark provides the scatter
function for performing
parallel updates. For example, scatter xs is ys
returns the array
xs
modified such that xs[is[i]] = ys[i]
for every pair
(is[i],ys[i])
, and treats the old xs
as consumed. (Incidentally,
I always felt that scatter
should be a syntactic extension of
with
-expressions rather than a function.)
The main restriction of scatter
is that you can only do one “update”
per input element. For some irregular algorithms, the amount of
updates might be a dynamic data-dependent quantity. To address this,
we would like to introduce the notion of an “accumulator”:
type acc 't
A value of type acc t
represents a “write-only view” of an array
with element type t
. Similar to how a with
-expression can update
an array, we provide a function for updating an accumulator:
val upd 't : *acc t -> i64 -> t -> acc t
The *
before the accumulator parameter means that the argument is
consumed and may not be used again. Instead a new accumulator is
returned. By requiring such a linear use of accumulators (in the
linear
types
sense) we ensure that data dependencies are clear. By making
accumulators write-only, we allow multiple “threads” to operate on
them concurrently without opening the door to nondeterminism.
Accumulators are brought into the world by a higher-order function:
val with_acc [n] 't :
(dest: *[n]t) -> (f: *acc t -> acc t) -> [n]t
Intuitively, with_acc
temporarily turns an array dest
into an
accumulator, passes it to the provided function which returns a
modified accumulator, and then turns the accumulator back into an
array, which will reflect the updates performed.
This leaves a lot of questions still to answer:
How do we allow parallel updates? We cannot consume an outer accumulator from within a lambda we pass to
map
, due to the safety rule mentioned above. Do we justreplicate
accumulators andmap
them? Then how do we turn the resulting array of accumulators back into a single one?What are the semantics for accumulators? If we require that they are used in an exclusively linear manner, then they can be modeled as lists of index-value pairs; recording updates that are then applied to an array at the end.
If we allow multiple accumulators to exist concurrently for different arrays with the same element type, how do we check that the function passed to
with_acc
returns the right one?
Futhark does not have full linear types, but our uniqueness types can
model some of it, if we are careful about which functions we expose.
But we do lose flexibility, as it means only a single accumulator can
usefully exist at once - the others will become “unconsumable” in
nested uses of with_acc
. It might be nice to have a limited form of
linearity for higher-order function types, such that we can
differentiate between with_acc
which applies its function once
(and so in principle needs not be restricted with respect to
consumption), and map
which applies its function many times.
Accumulators are already fully implemented in the Futhark core language, as they are a crucial building block for the AD transformation. This was a much easier problem to solve, as the core language is monomorphic and first-order, except for certain built-in second-order constructs that have known semantics. It’s really quite a challenge to figure out how to expose this power to the source language in a safe and sound manner.
Of course, one might argue that the only reason we have such trouble
is because Futhark is too stubbornly pure.
Dex solves this though
the use of an effect type system, where their main looping construct
(for
) is polymorphic in the underlying effects, and all this
business with accumulators is expressed as just an accumulation
effect. I don’t think retrofitting a full effect system is a good
idea for Futhark, however, as it’s too much of a divergence from the
original design philosophy. I don’t mind that languages aren’t good
at everything as long as they are really good at something.
Full flattening
One of Futhark’s original research challenges was how to efficiently map nested parallelism to GPUs and similar restricted processors. This problem had already been solved in principle by NESL and in practice by CuNESL. These solutions employ the very general full flattening algorithm, which always works but often produces somewhat inefficient code, in particular with respect to space usage and locality. For Futhark, we focused on a subset of parallelism, namely nested regular parallelism, for which we believed a less general technique could produce much more efficient code. This worked, and handles a useful class of programs, and any instances of irregular nested parallelism would usually result in either sequentialisation or a compiler error message.
The later work of Duc Minh Tran gave us a multicore backend that can handle any nested parallelism, but it always bothered me that there are valid Futhark programs that we cannot compile for GPU execution. I plan to take a serious look at implementing full flattening in the Futhark compiler, as part of the multi-versioned code generation scheme used by our incremental flattening algorithm. First I just want to get it to work, but I also have some ideas for (hopefully common) special cases that can be handled more efficiently - in particular for the (common?) case where the outer levels of parallelism are regular, and only an inner one is irregular.
I long ago decided that the criteria for releasing version 1.0 of the compiler would be the ability to compile any valid program to GPU code, and this is the last piece missing. Hopefully 2022 will see that happen.