AUTOMAP: How to do NumPy-style broadcasting in Futhark (but better)
Troels (the guy who usually writes the blog posts) asked me (Robert) to write about some recent work in Futhark that we call AUTOMAP, which allows you to do something akin to NumPy-style broadcasting in Futhark. I’m a PhD student here at DIKU and I’ve worked on Futhark one way or another for most of my PhD. (P.S. I’ll be graduating in December 2024 and I’m actively looking for potential postdoc positions in all things type systems/functional programming/compilers—please get in touch if you think I might be a match!)
AUTOMAP isn’t available in Futhark’s master
branch quite yet (we’ve
implemented a prototype)
but we expect to merge it in the nearish future. We’ve also written a paper on
AUTOMAP (currently in review); I’m going to give a talk based on it at
ARRAY24 on June 25th.
When Troels first suggested adding something like broadcasting to Futhark, we thought it’d be a fun/simple feature and a good break from the heavy duty work on automatic differentiation we had just published. In short, we were quite wrong about that and there was a surprising amount of complexity involved, but we ended up with a really cool (and dare I say elegant) solution.
The motivation for the work was a desire to make mathematical code involving
higher-dimensional arguments look closer to something you’d write by hand,
without the clutter introduced by a bunch of “administrative” maps that are
needed to lift scalar operations. For example, if you want to add the two
vectors (represented by arrays) [1,2,3]
and [4,5,6]
together, you have to
lift +
to work over the arrays:
1,2,3] [4,5,6] map2 (+) [
But if you were writing some math notes by hand, you’d probably just write:
1,2,3] + [4,5,6] [
with the tacit understanding that what you really mean is the piecewise addition
of the two vectors.1 In fact, piecewise addition is really the only
reasonable thing to do here if the idea is to “make +
work for these
higher-dimensional arguments”—shouldn’t we be able to write [1,2,3] + [4,5,6]
and have the compiler figure out that we really meant map2 (+) [1,2,3] [4,5,6]
? This isn’t only a boon for the lazy programmer; omitting the “noise”
of the map also makes things clearer and lets us use +
in its proper infix
location (sorry, Lisp programmers).
A few examples
As a first step, let’s try to pin down what it is that we want the compiler to do by looking at a few more examples. How about subtracting one matrix from another:
1,2,3], [[9,8,7],
[[4,5,6], - [6,5,4],
[7,8,9]] [3,2,1]] [
Now we have to lift -
to work over two-dimensional arrays instead of
one-dimensional, so we apply map2
twice:
1,2,3], [[9,8,7],
[[4,5,6], [6,5,4],
map2 (map2 (-)) [7,8,9]] [3,2,1]] [
For simplicity, we’re also going to assume map
is an n-argument map that works
for any number of arguments. So map (+1) [1,2,3] = [2,3,4]
, map (+) [1,2] [3,4] = [4,6]
, and so on. Hence the above can be expressed as just
1,2,3], [[9,8,7],
[[4,5,6], [6,5,4],
map (map (-)) [7,8,9]] [3,2,1]] [
What happens if we add together two things of different ranks? (“Rank” is just the number of dimensions an array has; for example a matrix has rank 2 and an integer has rank 0.) Let’s try a matrix and a vector:
1,2,3],
[[4,5,6], + [10,20,30]
[7,8,9]] [
Since a matrix (represented by a two-dimensional array) is involved, we know we we’ll have to lift twice:
1,2,3],
[[4,5,6], [10,20,30]
map (map (+)) [7,8,9]] [
But this isn’t enough because to add these two piecewise we need to be able to pair up each element of the matrix with an element of the vector. We’ll have to boost the dimensionality of the vector by replicating it to match the shape of the matrix:
1,2,3],
[[4,5,6], (replicate 3 [10,20,30])
map (map (+)) [7,8,9]] [
We actually don’t care about the specific number of times that we have to
replicate the vector to make it match the matrix because because the compiler
can figure it
out
later. So, to keep things simpler, we’ll use a “size-free” version of
replicate
that we call rep : a -> []a
. It just tacks on an extra dimension
by replicating its argument an unknown number of times (that will be determined
later). Think of rep
as replicate n
where the correct n
is chosen
automatically. So, using rep
instead of replicate
, the above becomes
1,2,3],
[[4,5,6], (rep [10,20,30])
map (map (+)) [7,8,9]] [
In the above examples, we’re applying scalar operators to higher-ranked
arguments and then massaging the application into something that makes sense/is
rank-correct by inserting map
s and rep
s. This works for more than just
scalar operators: any function application involving a function whose parameters
have a rank mismatch with the arguments can be transformed into a rank-correct
application in this way.
As a final more general example, let’s look at a function f : []int -> [][]int -> int -> int
applied to argument an xss : [][]int
:
f xss xss xss
The above application can be made rank-correct by inserting the following
map
s and rep
s
map ((map f xss) (rep xss)) xss
(All functions in Futhark are curried;
f xss xss xss
is really ((f xss) xss) xss
and is hence actually three
separate applications from the perspective of inserting map
s and rep
s.) This
approach to making applications rank-correct is completely agnostic to
infix/prefix application or any specific properties a function may have—an
application can always be made rank-correct by insertion of map
s and rep
s.
If the compiler can do said insertion automatically, then we can always apply
functions to arguments of any rank. Programming languages that support function
application like this are called rank polymorphic.2 Our task, then, is to
endow Futhark with rank polymorphism by giving it the capability to
automatically infer map
s and rep
s to make function applications
rank-correct.
(Static) rank polymorphism is hard
Most rank polymorphic languages are dynamic: they don’t do the requisite lifting and replicating until runtime, when the arguments to a function are fully known. NumPy and APL are probably the two most popular examples of dynamic rank polymorphic languages.
It’s generally easier to do rank polymorphism in dynamic programming systems like NumPy because you don’t have to figure out how exactly to make a function application rank-correct until runtime, when everything (function and arguments) is fully known. Of course, there are drawbacks including runtime overhead and a lack of transparency to the user (the only way to inspect how a function is lifted/replicated is to run the program).
In a statically-typed language like Futhark, things are harder. How to make a
function application rank-correct (how to insert map
s and rep
s) has to be
decided statically during type checking, without any runtime information
whatsoever.3 This is made especially hard by the fact that Futhark supports
parametric polymorphism (that is, you can write general functions with type
variables like id : a -> a
or map : (a -> b) -> []a -> []b
)—a type
variable can stand for any type and hence have any rank, which complicates
reasoning about how to insert map
s and rep
s.
Ambiguity makes things harder still
There are more challenges on the horizon: there
are often many different ways to insert map
s and rep
s to make
a function application rank-correct. A simple example is
sum (length xss)
where sum : []int -> int
, length : []a -> int
, and xss : [][]int
. One way
to make this application rank correct is by not modifying the length xss
application
at all and instead rep
it when it’s passed to sum
as an argument:
sum (rep (length xss))
But if you’re like me and lone rep
s give you the ick, you’ll probably prefer
sum (map length xss)
where the inner application with length
is lifted so that the outer
application with sum
doesn’t have to rep
its argument. Or we might just go
totally crazy and come up with this rank-correct concoction:
map sum (map (map length) (rep xss))
In fact, there are actually an infinite number of ways that we can insert
map
s and rep
s to make the application rank-correct. What should we choose
and how should we choose it? We’re going to need a strategy!
The Strategy
Our first step in coming up with a system in which the compiler can make a
reasonable choice on how to map
/rep
an application is laying some ground
rules. There are two:
Rule 1: An application can be map
ped or rep
ped (or neither) but never both.
This means, given f x
, we can write (as long as it’s rank-correct to do so)
map f x
, map (map f) x
, or f (rep (rep x))
, but writing map f (rep x)
or
any variation that includes a non-zero number of map
s and rep
s isn’t
allowed.
This restriction exists because it’s never necessary to both map
and rep
in
a single application since there are only two possible ways that there can be a
rank mismatch: either the argument’s rank is higher than what the function
expects (and hence a map
(or multiple) is required) or the argument’s rank is
less than what the function expects (and one or more rep
s is required).
Note that the example
1,2,3],
[[4,5,6], + [10,20,30]
[7,8,9]] [
which was elaborated into
1,2,3],
[[4,5,6], (rep [10,20,30])
map (map (+)) [7,8,9]] [
doesn’t violate Rule 1 because functions are curried—the above is actually two
separate applications, one with map
s and the other with a rep
.
Rule 2: Minimize the number of inserted map
s and rep
s.
This rule says that we want the compiler to insert the fewest number of map
s
and rep
s to make the program rank-correct. We found that this generally aligns
with the programmer’s intent and that it also provides for a simple mental model
for the programmer. For example, if ysss : [][][]int
is a three-dimensional array,
no map
s or rep
s will be inserted at all for the program length ysss
even
though map length ysss
and map (map length) ysss
are both legal elaborations.
This also helps us address ambiguity: the compiler no longer has to choose from an infinite number of solutions and instead only the minimal ones. Unfortunately, sometimes there are multiple minimal solutions and hence Rule 2 doesn’t entirely free us from the possibility of ambiguity.
Even with The Strategy, ambiguities may exist
Sometimes Rule 1 and Rule 2 aren’t enough. Returning to our previous example involving
sum
and length
, observe that
sum (rep (length xss))
and
sum (map length xss)
are both minimal because they have the same number of inserted operations (one each). When there are multiple minimal elaborations, we say that the program is ambiguous. In this case, we have to ask the programmer for some help by signaling an error. Because we’re doing everything statically, we know the two possibilities and the compiler can give the programmer a choice:
sum (length xss) is ambiguous. Possible elaborations:
1) sum (rep (length xss))
(2) sum (map length xss) (
We can conceive of an interactive system where the user just types 1
or 2
to
make a choice and disambiguate (although having the compiler modify the
programmer’s source code in this way probably isn’t a particularly good design).
A better (and simpler) approach is a system where the user has to manually
disambiguate. In practice, it’s actually pretty nice too! Why? Because rep
and
map
are just normal Futhark functions and can always be inserted by the user
to disambiguate. If the user wants option 2, they can just plop a map
in the
source code in front of length
to disambiguate.
This is actually a pretty cool feature: rank polymorphism in Futhark is
basically a kind of syntactic sugar that lets you leave map
s and rep
s
implicit. You can always disambiguate when necessary by just inserting map
s or
rep
s yourself (usually just one) to make the implicit map
or rep
explicit. It also means you can ask the compiler to elaborate all implicit
map
s and rep
s so you can always see what’s going on. No mystery here!
All applications must be considered simultaneously
Another thing to notice in the
sum (map length xss)
elaboration of sum (length xss)
is that we had to consider the outer
application with sum
to know how to map
the inner length
application. In
general, lifting one application can affect others that depend on it because
map
ping changes the rank of the application itself. The implication of this is
quite significant: inserting map
s or rep
s to make an application
rank-correct requires considering the entire top-level definition at hand
(rather than just the local application).
This doesn’t mean it’s impossible to do rank polymorphism by only considering
local applications one-by-one. If we had just considered the inner length xss
application in isolation, Rule 2 would yield length xss
without any
map
. Proceeding to the outer application then yields the other possible
minimal elaboration:
sum (rep (length xss))
But this was just good luck; only considering applications one-by-one will not
lead to a minimal solution in general. It will also often lead to “ugly”
elaborations that the programmer probably didn’t intend. Minimality in the
number of map
s and rep
s across a function’s body as a whole is a useful
target because we’ve found it usually corresponds with what the programmer
intends; non-minimality is often the product of needing extra map
s and
rep
s to overcome the extra dimensions introduced by map
s and rep
s in
earlier applications. In practice, programmers don’t write code like this and
hence the subsequent elaboration by the compiler probably won’t correspond to
their intent.
Now that we have our objective and strategy in hand, there’s only one thing left to do—find the rank-correct elaborations while heeding Rule 1 and Rule 2!
Finding minimal rank-correct elaborations
The first step involves traversing the program and, for every application,
generating constraints that encapsulate the requirements to obtain a
rank-correct program. We (or, rather, the compiler) can then solve those
constraints to figure out how to insert map
s or rep
s to make things
rank-correct.
Consider the application of some function f : a -> b
to an argument x : c
:
f x
In a normal application, we require that the type of function’s parameter and the actual argument are equal, so we’d have the constraint
a = c
In our context, we’re only interested in rank equality (actual equality will be
handled by the type checker later, after all map
s and rep
s have been
inserted and the applications are rank-correct), so we relax the constraint to
just enforcing rank equality, which we write like so:
|a| = |c|
where |a|
means “the rank of a
”. For example, |[][]int| = 2
and |int| = 0
.
Now we want to relax things further and allow for rank differences. If c
has
fewer dimensions than a
(i.e. |c| < |a|
) we’ll have to rep
it. We don’t
know the rank difference so we don’t know how much to replicate it by yet (how
many rep
s we need), so we’ll represent this unknown quantity with the rank
variable R
:
|a| = R + |c|
For example, if f : [][]int -> int
and x : []int
then we’d have
|[][]int| = R + |[]int|
which is equal to
2 = R + 1
We can clearly see that this constraint holds when R = 1
; this means we need
one rep
to make the application rank-correct and f (rep x)
would be a valid
elaboration.
The other way a
and c
might have a rank mismatch is that c
has more
dimensions than a
(i.e. |c| > |a|
), in which case we’ll have to map
. We
again don’t know how many map
s we’ll need, so we’ll represent it with the rank
variable M
:
M + |a| = |c|
Now, for example, if f : int -> int
and x : [][]int
then we’d have
M + |int| = |[][] int|
which is equivalent to
2 M =
This means we have to add two map
s to make the application rank-correct: map (map f) x
.
Of course—since we don’t know anything about the rank difference in
general—we don’t know if we need to map
or rep
so we need to include both
R
and M
in our rank constraint:
M + |a| = R + |c|
We still need to enforce our first rule—that we either map
or rep
but
not both. To do so, we just require that either M = 0
or R = 0
via an
additional constraint. Hence, every function application f x
generates two
constraints:
M + |a| = R + |c|
and
0 or R = 0 M =
During type checking, we accumulate these constraints into a big list. For
example, when checking sum (length xss)
, we first check length xss
(recalling that length : []a -> int
and xss : [][]int
). This yields the
constraints
1 + |a| = R₁ + 2
M₁ + 0 or R₁ = 0 M₁ =
(Where we’ve added the subscript 1
to the rank variables, since we’ll have
two sets of rank variables from the two applications and need to distinguish
them.) Notice that this time the first constraint includes the rank of a type
variable: |a|
. This is necessary because the ranks of type variables are not
immediately known (they’re type variables and can be anything). You should think
of |a|
as just another rank variable that the system will have to solve
for. To solve the above constraints, we have to find a mapping that assigns a
rank to M₁
, R₁
and |a|
.
Now we have to check the outer application involving sum
.
We have to be a bit careful when constructing the requisite rank equality
constraint between the parameters of sum
and its argument length xss
. If
the rank variable M₁
above is assigned a non-zero value, then that means that
the application length xss
has implicit maps—namely M₁
of them. This
means that (recalling that length : []a -> int
) the type of the term length xss
will not be int
but instead be an M₁
-dimensional int
array because
each implicit map
will boost the rank of the application itself by 1. So, we
have to tack on extra dimensions to the type of the application corresponding to
each implicit map
: length xss : []ᴹ¹ int
, where []ᴹ¹
is just a
sequence of []
of length M₁
(and where we understand length xss
to have
M₁
implicit maps). If we end up inferring that no map
s are to be inserted,
then M₁ = 0
and we’ll have length xss : []⁰ int
which is equivalent to
length xss : int
, as expected.
As an example, if xss : [][]int
then the explicitly elaborated application
map length xss : []int
is rank-correct and hence we assign the implicit
application length xss
the type []int
.
So, when generating constraints for sum (length xss)
we’re really generating
constraints for sum (length xss : []ᴹ¹ int)
because of the (potential)
implicit maps in length xss
. Recalling that sum : []int -> int
), we have
M₂ + |[]| = R₂ + |[]ᴹ¹ int|0 or R₂ = 0 M₂ =
which is equivalent to
1 = R₂ + M₁
M₂ + 0 or R₂ = 0 M₂ =
Notice that the rank equality constraint depends on M₁
(the map
rank
variable generated in the constraints for length xss
). This is an explicit
witness for the fact that how you insert map
s at one application affects
subsequent applications that depend on it.
Combining the constraints, the constraint set that expresses the rank
requirements for sum (length xss)
is
1 + |a| = R₁ + 2
M₁ + 0 or R₁ = 0
M₁ = 1 = R₂ + M₁
M₂ + 0 or R₂ = 0 M₂ =
What’s left is to find a minimal assignment for M₁
, R₁
, M₂
, R₂
, and
|a|
(and more generally for any constraint set generated in this way) such
that the above constraints hold .
Solving the constraints
Our first order of business is figuring out how to get a minimal solution (that
is, a way to enforce Rule 2—recall that Rule 1 is already enforced by the
constraints themselves). Since this just requires that the number of map
s and
rep
s (which are represented by the M
and R
rank variables, respectively)
is minimal, this is the same thing as minimizing the sum of the rank
variables. For the example above, that corresponds to the following minimization
problem:
minimize
M₁ + R₁ + M₂ + R₂
subject to1 + |a| = R₁ + 2
M₁ + 0 or R₁ = 0
M₁ = 1 = R₂ + M₁
M₂ + 0 or R₂ = 0 M₂ =
If you’re familiar with integer linear programs
(ILPs), you may have noticed
that this looks a lot like one except for the or-constraints enforcing
Rule 1, which aren’t linear. Fortunately, we can linearize those constraints
using some linear programming trickery: for each pair of rank variables Mᵢ
and Rᵢ
, we introduce two binary variables b_Mᵢ
and b_Rᵢ
that indicate
whether the application is to be map
ped or rep
ped. The idea is to enforce
that Mᵢ > 0
only if b_Mᵢ = 1
(and analogously for Rᵢ
and
b_Rᵢ
). Then, by requiring that b_Mᵢ + b_Rᵢ ≤ 1
, we then enforce that
either Mᵢ = 0
or Rᵢ = 0
. To establish the requisite relationship between
Mᵢ
and b_Mᵢ
, we add the constraint
Mᵢ ≤ b_Mᵢ * BIG_NUMBER
where BIG_NUMBER
is just a big number. When b_Mᵢ = 0
, we have Mᵢ ≤ 0
forcing it to be 0. When b_Mᵢ = 1
, we have Mᵢ ≤ BIG_NUMBER
which
effectively lets the solver assign Mᵢ
to be any rank-satisfying value (as
long as BIG_NUMBER
is bigger than any rank we expect to see in a program—in
practice ranks are small so it isn’t a big deal). We do something analogous for
Rᵢ
and b_Rᵢ
to obtain the linearized minimization problem
minimize
M₁ + R₁ + M₂ + R₂
subject to1 + |a| = R₁ + 2
M₁ + 1
b_M₁ + b_R₁ ≤
M₁ ≤ b_M₁ * BIG_NUMBER
R₁ ≤ b_R₁ * BIG_NUMBER1 = R₂ + M₁
M₂ + 1
b_M₂ + b_R₂ ≤
M₂ ≤ b_M₂ * BIG_NUMBER
R₂ ≤ b_R₂ * BIG_NUMBER
Now we have ourselves a real ILP! To solve it, we can just feed it into any existing solver (there are many, in our prototype we used GLPK) or implement one of the known algorithms for ILP solving. Technically, ILP is in NP (all known solvers have exponential runtime) but in practice solvers are very fast and the ILPs we generate are small, so even if you’re solving a bunch of these ILPs when type checking a large program it still makes for a very usable system. (Besides, normal type inference has exponential runtime anyway.)
Putting it all together
We now have all the ingredients we need to bestow Futhark with the power of rank polymorphism, a.k.a AUTOMAP. Here’s how the AUTOMAP system works from start-to-finish for each top-level definition in a program:
For each application generate a rank equality constraint and a Rule 1 constraint and add it to the constraint set.
Transform the constraint set into an ILP and solve it. If the ILP has no solutions or multiple solutions, fail and report an error to the user.
The solution to the ILP tells the compiler how to insert the minimal number of
map
s andrep
s to make the program rank-correct. For example, ifMᵢ = 3
andRᵢ = 0
for an applicationf x
, then transform the application by inserting threemap
s and norep
s:f x ⟶ map (map (map f)) x
.Type check the transformed program with all implicit
map
s andrep
s made explicit using Futhark’s normal type checker and continue with compilation as usual.
The extra nice thing here is that we can do all this without affecting how the
rest of the compiler works. This is a direct result of the fact that Futhark
treats rank polymorphism as syntactic sugar and there’s no “magic” going
on—all implicit constructs (map
and rep
) are just standard functions
already available to the programmer. This also means that the programmer is free
to use the AUTOMAP to whatever extent that they wish—from omitting all map
s
and rep
s (up to ambiguity), to only omitting some, to having a program with all map
s
and rep
s explicit.
That’s it! P.S. In case you’re wondering—we call the system AUTOMAP instead of
AUTOMAPREP because implicit map
s are way more common than implicit rep
s (and
AUTOMAPREP isn’t as catchy).
Footnotes
Of course you could be rigorous here and also add something along the lines of “define
(x + y)_i = x_i + y_i
” (wherex_i
is thei
th component ofx
) to your notes so that the operation is well-defined. With this approach there are really two definitions of+
—one for scalars and one for vectors and we’ve just grouped them under the same name. From a programming languages perspective, this corresponds to ad-hoc polymorphism (a.k.a function/operator overloading) wherein you group a bunch of functions (which expect different inputs) under the same name.↩︎In most languages with rank polymorphism, this isn’t a property of function applications but rather of functions themselves. Typed Remora is a nice example because it demonstrates this explicitly at the type-level: just like how we abstract over type parameters in parametric polymorphism (e.g.,
id : forall a. a -> a
), Remora abstracts over shape parameters as well (“shape polymorphic” might be a better name than “rank polymorphic”). For example,sum : forall S. S int -> int
(whereS
is a shape variable) might collapse any higher-dimensional array (say a matrix or a vector or a 36-dimensional tensor) into an integer. Unfortunately, this generality comes at a cost and makes type inference very, very challenging.↩︎You may be wondering why you couldn’t just type check function applications modulo rank mismatches and delay deciding how to
map
andrep
until runtime. There are multiple reasons that is a bad idea aside from the usual dynamic rank polymorphism drawbacks: the ensuing lack of precise static type information would also weaken type checking/inference, make many optimizations intractable, and basically muck-up the existent compiler from end to end.↩︎