The why and how of parallel in-place random-access accumulation
The Futhark intermediate representation (IR), used by the compiler, is very similar to the source language used by programmers. While the IR lacks syntactic sugar and facilities like polymorphism and higher order functions, the language constructs are fundamentally the same - with one exception, which is the topic of this post. It turned out that when implementing automatic differentiation a couple of years ago, we needed to efficiently express certain patterns of computation that were simply beyond our available programming vocabulary. When we finally came up with an extension that had the properties we needed, it is still not clear how we could expose it in a safe manner in the source language, and so it remains an IR plaything. In this post I will describe the need for this extension, how we designed it for the IR, and why it is still not available for use in the source language. Bottom line up front: the idea is to encode certain effects using linear values.
Basics of automatic differentiation
If you do not care about the background of the problem, then feel free to skip this section.
Automatic differentiation (AD) is a program transformation for computing mathematical derivatives of computer programs that implement mathematical functions. As a Haskell programmer, AD reminds me of monads in that it is notoriously difficult to explain, and once you understand it, you lose the ability to explain it to others. Every paper on AD seems to start with an incomprehensible introduction - feel free to peruse the one in the paper we wrote on AD in Futhark. I really need to write a proper post about AD from a Futhark perspective at some point, but it is not necessary to have an understanding of AD to understand this post. If you are truly curious about how AD works, then this paper is in my opinion the best introduction.
AD can be used for many things (I use it for surface normals here), but one common use is in function optimisation. A cost function is a function that takes as input some parameters and returns a single number, denoting the “error” of the function relative to some goal. The problem is then tweaking the parameters to make the return value as small as possible. This is basically how machine learning works. If we can find the gradient of the function, then we can nudge the parameters along the direction of the gradient, and gradually decrease the error.
AD lets us easily compute the gradient of functions that are too awkward to differentiate by hand. The basic idea is that at the bottommost level, the execution of a function for some input consists of a dataflow graph where each node represents some primitive operation on numbers, such as an addition, multiplication, square root, etc. In this view, control flow and memory operations have been “unrolled” - we focus on a concrete execution for fixed input; essentially a recording of the low-level arithmetic operations. To take the derivative of such a graph, we can apply the chain rule and the partial derivatives of all primitive operations and simply propagate information through the graph. Some implementations of AD, such as CoDiPack actually work by recording such a graph during runtime and differentiating it afterwards, while others (such as the one in Futhark) implement AD as a program transformation.
There are two main ways to perform automatic differentiation. The simplest is called forward mode, and works by pairing up each normal input with a tangent - essentially, we turn all numbers into pairs of numbers. Each primitive operation of numbers then becomes an operation on pairs of numbers. Using the chain rule and the partial derivatives of all primitive operations in our language, we can then compute a partial derivative for each input to the function. The forward mode of AD is very simple to implement, and can in many languages simply be done as a library. However, the downside of the forward mode of AD is that if we have n inputs, then we also need to do n passes over the program - this is not efficient for cost functions, which often have an enormous number of inputs but only a single output.
While the forward mode of AD works by attaching extra information (“tangents”) to the inputs and then propagating it down through the dataflow graph, the reverse mode attaches extra information (called “sensitivity”, “cotangents”, or “adjoint”) to the results, and then passing it backwards through the dataflow graph, again by using the chain rule. If we actually have a dataflow graph representing the function execution, then this is not so difficult to do, but if we want to implement it as a program transformation, then it becomes quite mind-bending: we essentially have to construct a differentiated program that runs the original (“primal”) program in reverse, where all reads of a variable become a write.
Describing this process in full detail is far beyond the post, but the idea is
that each scalar variable x in the original program becomes associated with
cotangent variable x_ct, and whenever the original program had an operation
such as
z = x * y
then the differentiated program will contain statements
x_ct += y * z_ct
y_ct += x * z_ct
where we propagate the accumulated cotangent of the z variable into the
cotangents of x and y. The important point is this: Every read of x in
the original program becomes an += write in the derivative.
The notation used above is imperative, and += does not make much sense in a
purely functional language such as Futhark. However, it is not difficult to
imitate += by simply re-binding a variable to a new name - that is how we
always simulate state. We have to be careful when this happens inside of control
flow such as if or loop but it mostly becomes a book-keeping challenge about
keeping track of which variable represents the most updated cotangent, and not
something that is truly difficult.
But consider now this Futhark expression fragment:
let ys = map (\i -> xs[i]) isHere we are using an array of indexes, is, to access another array, xs. This
means that the data dependencies are dynamic - they depend on the concrete
values of is. As with scalars, the cotangents of an array xs is represented
by an array xs_ct, and similarly for ys/ys_ct. We need to update the
elements of xs_ct based on the values of ys_ct, but also while taking into
account which elements of xs contributed to each value of ys. Supposing is
is of length n, then in imperative notation it is not so difficult to
express this:
for j < n:
xs_ct[is[j]] += ys_ct[j]
But in a parallel and purely functional setting, this seemingly simple program
is not easy to express at all. At the machine level, the += accumulation
itself needs some care if we want to run the loop in parallel, but it can be
implemented using atomic updates or some other synchronisation mechanism. The
real problem is that this loop, where we do a write to some arbitrary element
of an array is not easily expressible using our existing data-parallel
vocabulary. For this simple case, where each iteration accesses only a single
element, we can actually express the differentiated program using generalised
histograms,
but this is not possible when the mapped function contains a loop and indexes
xs a dynamic number of times. For simplicity of exposition, I will stick to
the simpler case, even though it is actually a special case that does not need
any fancy techniques.
The root problem is that we are updating an array that originates from outside
the mapped function - it is a free variable. One solution, which you will find
in some implementations of AD, is to simply eliminate the free variables and
rewrite the original program to look like this:
let xs_copies = replicate n xs
let ys = map2 (\xs_copy i -> xs_copy[i]) xs_copies isHowever, the problem is that the modified program is asymptotically less
efficient than the original - we are replicating the xs array n times (where
n i the size of is). We can perhaps avoid actually manifesting the
replicated array by using a clever representation, but differentiating the
program above will produce something like this:
let xs_copies_ct = map2 (\i y_t -> replicate m 0 with [i] = y_t)
is
ys_ct
let xs_ct = map (reduce (+) 0) (transpose xs_copies_ct)This works. However, it works by producing a two dimensional array
xs_copies_ct with rows that each contain a single nonzero value, which we add
together to produce the final cotangent array xs_ct. This is again
asymptotically inefficient because we spend a lot of time adding together
zeroes. This can be avoided if we represent the xs_copies_ct array using some
sparse representation and a more clever way of summation - but the Futhark IR
does not have a datatype for sparse arrays, and it becomes tricky to detect
and exploit the sparsity once we do not statically know how often the xs array
is indexed inside the map.
Language approaches
To restate the problem, our goal is that we want to be able to express computation equivalent to this imperative pseudocode:
for x in xs:
# ... arbitrary control flow, maybe nested loop ...
i = ... # something complicated
v = ... # something complicated
ys[i] += v # where ys is bound outside the loop
Where we also want the outermost loop to be expressed via data-parallel language constructs, as is always the case in Futhark’s IR, and naturally not sacrifice our purely functional semantics.
The issue is basically that we want to perform a side effect. Dex, a functional language with many AD-related innovations, had the same problem and addressed it in quite a natural way: by adding an effect system. Using an effect system, a program can perform side effects (such as writing to the element of an array), but the type system makes precise when it happens. Further, the effects are controlled: while the Dex accumulation effect allows writes into an array, that array cannot be read from while being accumulated into. Assuming the accumulation effect is associative and commutative (which is the case for addition), then the updates can be done in parallel without any risk of race conditions.
I have a hunch that side effects, possibly in the form of algebraic effect handlers, may be the next big thing in functional languages. Newer languages such as Flix and Koka are investigating how to implement them efficiently and make the ergonomics tolerable for everyday programming. Why, then, did we not take this approach in Futhark?
The reason is while some people claim that programs are mainly for computers to run, others that they are for people to read, but I think that programs are for computers to re-write. Most of the Futhark compiler consists of passes that rewrite Futhark programs, and most of these passes are based on local rewrites where expressions are rewritten based on local information. A lot of Futhark’s optimisation power comes from the fact that the IR is very easy to analyse. In particular, the following principles apply to the IR design:
All data dependencies are explicit. An expression depends only on those variables it explicitly references.
Expression ordering is unimportant as long as data dependencies are respected.
Duplicating expressions is harmless.
The result of every expression is bound to a variable, and if that variable is not used, then the expression can be removed.
We do violate these principles slightly when it comes to uniqueness types, and unsurprisingly, this has also been the part of the IR that has caused the most difficulty when writing optimisations. But even they mostly impose constraints on expression ordering and duplication, and do not violate the principle of explicit data dependencies (and the duplication constraint can be solved with copying).
Adding accumulation effects would violate principle 3 above - a statement ys[i] += v does not produce a value that is explicitly referenced by some subsequent
computation, but is instead evaluated solely for its side effects. This means
that every transformation that moves or duplicates code would need to consider
the implications for effects. Compilers for imperative languages have this
problem all the time, but we implementers of functional languages are quite
spoiled, and I’d rather not burden every part of the Futhark compiler by adding
an effect system. However, I still want the nice operational advantages of being
able to poke directly at an array element. What is to be done?
Accumulators
Our solution to the conundrum is either a hack or an elegant extension, based on your perspective. While it has certain ad-hoc elements to it, it preserves the principles of the IR, and is semantically clean.
Specifically, our solution was to extend the Futhark IR with a new fundamental
type, called an accumulator, and some additional builtin operations. An
accumulator type acc(t) essentially denotes a write-only view of an array of
type t. For example, the type acc([n]f64) represent a write-only view of an
array of type [n]f64. We then add a builtin operation update_acc for
updating an accumulator at a given index. This operation is polymorphic in the
array shape and so cannot be given a type using Futhark function type notation,
but we can give a specialisation for [n]f64:
update_acc : acc([n]f64) -> i64 -> f64 -> acc([n]f64)
That is, you pass the accumulator, an index, a value, and get back a new accumulator. As a minor detail, we ignore out-of-bounds writes - this turns out to be convenient in some cases.
Accumulators are constructed using a function with_acc that you pass an array
and a function. The function is then called with the accumulator, and must
return the updated accumulator, at which point it is converted back into an
array:
with_acc : [n]f64
-> (acc([n]f64) -> acc([n]f64))
-> [n]f64
Semantically, this is a trivial construct. We can model it like this in Haskell:
data Acc t = Acc [t]
with_acc :: [t] -> (Acc t -> Acc t) -> [t]
with_acc xs f = let Acc xs' = f (Acc xs) in xs'
update_acc :: Acc t -> Int -> t -> Acc t
update_acc (Acc xs) i x =
if i >= 0 && i < length xs
then Acc (take i xs ++ [x] ++ drop (i + 1) xs)
else Acc xsIn this model, an accumulator is just a list, and updating it changes the corresponding element of the list.
Of course, the goal is that in Futhark, update_acc will directly update the
array underlying the accumulator, with constant cost, which is not the case for
the Haskell model. Some constraints are needed for this to be safe.
First, while an array is being used as underlying an accumulator, it must not be
possible to access the original array, as this would allow observation of
intermediate states. That is, the array passed to with_acc must become
inaccessible. Fortunately, this is actually expressible using Futhark’s
uniqueness types - we simply consume the
array passed to with_acc.
Second, an accumulator must not be used multiple times. This can also be ensured
by making update_acc consume the accumulator.
Finally, the accumulator returned by update_acc must be used for something.
This is actually ensured by with_acc, as the function passed must return an
accumulator, and the only way to obtain an accumulator is to return the one
passed to the function, or one returned by update_acc.
Type system aficionados may recognise that these are basically the properties of linear types. While Futhark does not have a true linear type system, the uniqueness type system can be used to simulate linearity constraints if we are careful. Essentially, we are encoding effects using linear values.
We’ve still not been sufficiently careful however, as it is still possible to
mess up. Specifically, if we nest with_accs, then there is nothing
preventing us from returning the same accumulator multiple times, or returning
the wrong one:
with_acc xs (\xs_acc ->
with_acc ys (\ys_acc ->
xs_acc)
ys_acc)
This is pretty bad, as it would allow us to observe an array while an
accumulator exists for it. Our solution is to refine the accumulator type by
associating each with a unique certificate that acts a bit like a region
tag and uniquely
identifies the with_acc that gave rise to it. Specifically, an accumulator now
has type acc(c,t), where c is the name of a variable of type unit bound by
with_acc, which now has the following type:
with_acc : [n]f64
-> ((c: unit) -> acc(c, [n]f64) -> acc(c, [n]f64))
-> [n]f64
This means that the following will now be a type error:
with_acc xs (\xs_c xs_acc ->
with_acc ys (\ys_c ys_acc ->
xs_acc)
ys_acc)The inner with_acc expects the function to return something of type
acc(ys_c,[n]f64), but it actually returns something of type
acc(xs_c,[n]f64). Thus we are prevented from accidentally doing the wrong
thing.
The treatment above assumes that accumulation is about overwriting a value in
an array, but our motivation is, well, accumulation, with an operator such as
addition. We then further extend with_acc to accept a combining operator
that is used to update the array element with the new value:
with_acc : (f64 -> f64)
-> [n]f64
-> ((c: unit) -> acc(c, [n]f64) -> acc(c, [n]f64))
-> [n]f64For AD, it will always be some appropriate addition operator, but in general it can be any associative and commutative operator.
I rather like our design for accumulators: it is semantically simple, data dependencies are explicit, and it piggybacks on the existing uniqueness/consumption system that the compiler already understands and handles correctly. There is one new edge case to consider, in that duplicating an accumulation cannot be fixed by copying, as accumulators cannot be copied, but this has turned out not to be a big issue in practice, as the compiler generally avoids duplicating code. I should clarify that I am not averse to rewriting big swathes of the compiler when needed, but adding actual effects to Futhark would impose a tax on all future compiler passes as well. This is a cost I would prefer to not pay if we can avoid it.
The implementation is also straightforward. Code generation for a with_acc
does basically nothing, except some book-keeping to relate an accumulator with
an array. When generating code for update_acc, the compiler looks up which
array the accumulator corresponds to, and simply generates the necessary memory
write.
There is one design aspect left to discuss, which is the reason we went to all this trouble in the first place: how accumulators interact with parallelism.
Parallel accumulators
The crucial aspect of accumulators that makes them safe is that they are used in
a linear fashion. That, however, means that we cannot simply use them from
within a parallel map. Consider:
with_acc (+) xs (\xs_c xs_acc ->
map (\i -> update_acc xs_acc i 1)
is)Two problems arise:
We are consuming
xs_accmultiple times because of the enclosingmap, which is not allowed.We are returning an array of accumulators, rather than a single one, which is not what
with_accexpects. We could index the array to pick out an arbitrary accumulator (they all reference the same underlying array after all), but this breaks linearity.
We need a way to split a single accumulator into multiple accumulators, and to
join them again. Our original approach was to just add such constructs to the
IR, but we had a hard time figuring out the properties that governed safe use:
in particular, indexing arrays of accumulators is basically never the right
thing to do, unless you index all of them, and then later join them. After a
while we realised that the only use we ever made of these split/join operations
was to split just before a map, and join immediately after. This made us
decide to simply build split/join of accumulators directly into map itself,
which in the IR is a language construct, not a function, and thus can have
ad-hoc rules. In a sense, we decided that an accumulator can implicitly be
treated as an array of accumulators, and that constructing an array of
accumulators (as happens when map returns them) implicitly joins them into a
single accumulator. That is, any construction of a type [m]acc(c,[n]f64)
immediately becomes acc(c,[n]f64). We then write the example above simply as:
with_acc (+) xs (\xs_c xs_acc ->
map2 (\xs_acc' i -> update_acc xs_acc' i 1)
xs_acc
is)Conceptually, each iteration of the map2 is given its own logical accumulator,
which is then implicitly joined at the end.
This seems a bit ad-hoc, and whenever I end up doing ad-hoc things for operational reasons, I want to make sure that the underlying semantics remain clean. And indeed, we can still model this notion of accumulators in a semantically clean way in Haskell. We have to change our representation from an accumulator being a list of values to instead being a list of the desired updates, that is, a list of index/value pairs:
data Acc t = Acc [(Int, t)]
with_acc :: (t -> t -> t) -> [t] -> (Acc t -> Acc t) -> [t]
with_acc op xs f =
let Acc ixs = f (Acc []) in foldl update xs ixs
where
update xs (i, x) =
if i >= 0 && i < length xs
then take i xs ++ [op (xs !! i) x] ++ drop (i + 1) xs
else xs
map_acc :: (Acc t -> v -> Acc t) -> Acc t -> [v] -> Acc t
map_acc f acc vs =
let accs = map (f (Acc [])) vs
in Acc $ unAcc acc ++ concat (map unAcc accs)
where
unAcc (Acc ixs) = ixs
update_acc :: Acc t -> Int -> t -> Acc t
update_acc (Acc ixs) i x =
Acc ((i, x) : ixs)Now update_acc merely adds an index/value pair to the list stored in the
Acc, and not until with_acc finishes do we manifest the updates to the
target list. Again, this is a semantic model: it’s not the kind of code we
expect the compiler to generate, and the operational properties are all wrong.
But it tells us that the meaning of a Futhark IR program that makes use of
accumulators is not particularly hard to understand. The model of map_acc in
Haskell is particularly simple, as it just concatenates the updates produced by
the function.
Operationally, maping an accumulator doesn’t do anything. It’s the same array
being referenced by all iterations, after all. We just have to be careful that
the updates are properly synchronised, which can be done in various ways
depending on the operator - for the case of AD and when using a parallel
backend, it will often be some kind of atomic addition.
Accumulators in the source language
Accumulators have been part of the Futhark IR since we made the initial release of the AD transformation, but they are not part of the source language. This is despite accumulators potentially being useful for things beyond expressing the output of reverse-mode AD. The reason we have not exposed them is not because we think the compiler should keep all the fun to itself, but because the constraints we have imposed to guarantee safe use are not expressible in the source language.
As a reminder, safety in the core language depends on certificates and
linearity, and flexibilty depends on special-casing map.
To start with, the source language is powerful enough to encode linearity. In a previous blog post I demonstrated how to exploit this to avoid bugs when managing random number generator states.
The problem is that the Futhark source language has very few built-in
constructs - everything is exposed through higher order functions, and higher
order functions all follow the same rules. For accumulators, the main
problematic rule is that when passing a function f to some higher order
function (such as map), then f may not consume any free variables - this is
because the type checker cannot reason about how many times f is applied. For
example, it means that nesting with_acc would be heavily restricted - the
outer accumulator cannot be updated inside the innermost with_acc.
Further, since map is just like any other function to the type checker, we
cannot allow implicit splitting/joining as in the IR (where map is a language
construct), and explicit splitting/joining of accumulators has unclear semantics
to me.
In truth, we have exposed accumulators in a limited and undocumented form, purely for the purpose of writing tests. Please do not use them in real programs; it would make me sad. Anyway, the interface is this:
type~ acc 't
val scatter_stream [k] 'a 'b :
(dest: *[k]a)
-> (f: *acc ([k]a) -> b -> acc ([k]a))
-> (bs: []b)
-> *[k]a
val reduce_by_index_stream [k] 'a 'b :
(dest: *[k]a)
-> (op: a -> a -> a)
-> (ne: a)
-> (f: *acc ([k]a) -> b -> acc ([k]a))
-> (bs: []b)
-> *[k]a
val write [n] 't :
(acc: *acc ([n]t))
-> (i: i64)
-> (v: t)
-> *acc ([n]t)The nomenclature is a little different: we say write instead of update_acc.
Further, we do not expose with_acc directly, but rather provide two constructs
that combine with_acc with a map over some input array. Also, the acc type
itself does not have the certificate we use in the IR, because the inability to
nest accumulators makes it unnecessary.
I don’t think the inability to program with accumulators in the source language
is a big problem, but I have wondered a few times what it would take to make
it possible. Currently my hunch is that if we add some notion of “linear
function” that a higher order function can use to indicate that a provided
function is only applied once, or perhaps encode consumption using an effect
system, then we may be able to loosen the restrictions on consuming free
variables, which may make it possible to expose with_acc in a more general
way. I am reluctant to complicate the Futhark type system for the usual
reasons, but I
am quite interested in the research challenge of compiling a language with
effects to one that does not have effects, but which instead models effects
using linear variables - like the Futhark IR. But perhaps this is best
investigated by designing a new language that just happens to target the Futhark
IR.