Fork me on GitHub

AUTOMAP: How to do NumPy-style broadcasting in Futhark (but better)

Posted on June 17, 2024 by Robert Schenck

Troels (the guy who usually writes the blog posts) asked me (Robert) to write about some recent work in Futhark that we call AUTOMAP, which allows you to do something akin to NumPy-style broadcasting in Futhark. I’m a PhD student here at DIKU and I’ve worked on Futhark one way or another for most of my PhD. (P.S. I’ll be graduating in December 2024 and I’m actively looking for potential postdoc positions in all things type systems/functional programming/compilers—please get in touch if you think I might be a match!)

AUTOMAP isn’t available in Futhark’s master branch quite yet (we’ve implemented a prototype) but we expect to merge it in the nearish future. We’ve also written a paper on AUTOMAP (currently in review); I’m going to give a talk based on it at ARRAY24 on June 25th.

When Troels first suggested adding something like broadcasting to Futhark, we thought it’d be a fun/simple feature and a good break from the heavy duty work on automatic differentiation we had just published. In short, we were quite wrong about that and there was a surprising amount of complexity involved, but we ended up with a really cool (and dare I say elegant) solution.

The motivation for the work was a desire to make mathematical code involving higher-dimensional arguments look closer to something you’d write by hand, without the clutter introduced by a bunch of “administrative” maps that are needed to lift scalar operations. For example, if you want to add the two vectors (represented by arrays) [1,2,3] and [4,5,6] together, you have to lift + to work over the arrays:

map2 (+) [1,2,3] [4,5,6]

But if you were writing some math notes by hand, you’d probably just write:

[1,2,3] + [4,5,6]

with the tacit understanding that what you really mean is the piecewise addition of the two vectors.1 In fact, piecewise addition is really the only reasonable thing to do here if the idea is to “make + work for these higher-dimensional arguments”—shouldn’t we be able to write [1,2,3] + [4,5,6] and have the compiler figure out that we really meant map2 (+) [1,2,3] [4,5,6]? This isn’t only a boon for the lazy programmer; omitting the “noise” of the map also makes things clearer and lets us use + in its proper infix location (sorry, Lisp programmers).

A few examples

As a first step, let’s try to pin down what it is that we want the compiler to do by looking at a few more examples. How about subtracting one matrix from another:

[[1,2,3],   [[9,8,7],
 [4,5,6], -  [6,5,4],
 [7,8,9]]    [3,2,1]]

Now we have to lift - to work over two-dimensional arrays instead of one-dimensional, so we apply map2 twice:

                [[1,2,3], [[9,8,7],
map2 (map2 (-))  [4,5,6],  [6,5,4],
                 [7,8,9]]  [3,2,1]]

For simplicity, we’re also going to assume map is an n-argument map that works for any number of arguments. So map (+1) [1,2,3] = [2,3,4], map (+) [1,2] [3,4] = [4,6], and so on. Hence the above can be expressed as just

              [[1,2,3], [[9,8,7],
map (map (-))  [4,5,6],  [6,5,4],
               [7,8,9]]  [3,2,1]]

What happens if we add together two things of different ranks? (“Rank” is just the number of dimensions an array has; for example a matrix has rank 2 and an integer has rank 0.) Let’s try a matrix and a vector:

 [4,5,6],  + [10,20,30]

Since a matrix (represented by a two-dimensional array) is involved, we know we we’ll have to lift twice:

map (map (+))  [4,5,6],  [10,20,30]

But this isn’t enough because to add these two piecewise we need to be able to pair up each element of the matrix with an element of the vector. We’ll have to boost the dimensionality of the vector by replicating it to match the shape of the matrix:

map (map (+))  [4,5,6],  (replicate 3 [10,20,30])

We actually don’t care about the specific number of times that we have to replicate the vector to make it match the matrix because because the compiler can figure it out later. So, to keep things simpler, we’ll use a “size-free” version of replicate that we call rep : a -> []a. It just tacks on an extra dimension by replicating its argument an unknown number of times (that will be determined later). Think of rep as replicate n where the correct n is chosen automatically. So, using rep instead of replicate, the above becomes

map (map (+))  [4,5,6],  (rep [10,20,30])

In the above examples, we’re applying scalar operators to higher-ranked arguments and then massaging the application into something that makes sense/is rank-correct by inserting maps and reps. This works for more than just scalar operators: any function application involving a function whose parameters have a rank mismatch with the arguments can be transformed into a rank-correct application in this way.

As a final more general example, let’s look at a function f : []int -> [][]int -> int -> int applied to argument an xss : [][]int:

f xss xss xss

The above application can be made rank-correct by inserting the following maps and reps

map ((map f xss) (rep xss)) xss

(All functions in Futhark are curried; f xss xss xss is really ((f xss) xss) xss and is hence actually three separate applications from the perspective of inserting maps and reps.) This approach to making applications rank-correct is completely agnostic to infix/prefix application or any specific properties a function may have—an application can always be made rank-correct by insertion of maps and reps. If the compiler can do said insertion automatically, then we can always apply functions to arguments of any rank. Programming languages that support function application like this are called rank polymorphic.2 Our task, then, is to endow Futhark with rank polymorphism by giving it the capability to automatically infer maps and reps to make function applications rank-correct.

(Static) rank polymorphism is hard

Most rank polymorphic languages are dynamic: they don’t do the requisite lifting and replicating until runtime, when the arguments to a function are fully known. NumPy and APL are probably the two most popular examples of dynamic rank polymorphic languages.

It’s generally easier to do rank polymorphism in dynamic programming systems like NumPy because you don’t have to figure out how exactly to make a function application rank-correct until runtime, when everything (function and arguments) is fully known. Of course, there are drawbacks including runtime overhead and a lack of transparency to the user (the only way to inspect how a function is lifted/replicated is to run the program).

In a statically-typed language like Futhark, things are harder. How to make a function application rank-correct (how to insert maps and reps) has to be decided statically during type checking, without any runtime information whatsoever.3 This is made especially hard by the fact that Futhark supports parametric polymorphism (that is, you can write general functions with type variables like id : a -> a or map : (a -> b) -> []a -> []b)—a type variable can stand for any type and hence have any rank, which complicates reasoning about how to insert maps and reps.

Ambiguity makes things harder still

There are more challenges on the horizon: there are often many different ways to insert maps and reps to make a function application rank-correct. A simple example is

sum (length xss)

where sum : []int -> int, length : []a -> int, and xss : [][]int. One way to make this application rank correct is by not modifying the length xss application at all and instead rep it when it’s passed to sum as an argument:

sum (rep (length xss))

But if you’re like me and lone reps give you the ick, you’ll probably prefer

sum (map length xss)

where the inner application with length is lifted so that the outer application with sum doesn’t have to rep its argument. Or we might just go totally crazy and come up with this rank-correct concoction:

map sum (map (map length) (rep xss))

In fact, there are actually an infinite number of ways that we can insert maps and reps to make the application rank-correct. What should we choose and how should we choose it? We’re going to need a strategy!

The Strategy

Our first step in coming up with a system in which the compiler can make a reasonable choice on how to map/rep an application is laying some ground rules. There are two:

Rule 1: An application can be mapped or repped (or neither) but never both.

This means, given f x, we can write (as long as it’s rank-correct to do so) map f x, map (map f) x, or f (rep (rep x)), but writing map f (rep x) or any variation that includes a non-zero number of maps and reps isn’t allowed.

This restriction exists because it’s never necessary to both map and rep in a single application since there are only two possible ways that there can be a rank mismatch: either the argument’s rank is higher than what the function expects (and hence a map (or multiple) is required) or the argument’s rank is less than what the function expects (and one or more reps is required).

Note that the example

 [4,5,6],  + [10,20,30]

which was elaborated into

map (map (+))  [4,5,6],  (rep [10,20,30])

doesn’t violate Rule 1 because functions are curried—the above is actually two separate applications, one with maps and the other with a rep.

Rule 2: Minimize the number of inserted maps and reps.

This rule says that we want the compiler to insert the fewest number of maps and reps to make the program rank-correct. We found that this generally aligns with the programmer’s intent and that it also provides for a simple mental model for the programmer. For example, if ysss : [][][]int is a three-dimensional array, no maps or reps will be inserted at all for the program length ysss even though map length ysss and map (map length) ysss are both legal elaborations.

This also helps us address ambiguity: the compiler no longer has to choose from an infinite number of solutions and instead only the minimal ones. Unfortunately, sometimes there are multiple minimal solutions and hence Rule 2 doesn’t entirely free us from the possibility of ambiguity.

Even with The Strategy, ambiguities may exist

Sometimes Rule 1 and Rule 2 aren’t enough. Returning to our previous example involving sum and length, observe that

sum (rep (length xss))


sum (map length xss)

are both minimal because they have the same number of inserted operations (one each). When there are multiple minimal elaborations, we say that the program is ambiguous. In this case, we have to ask the programmer for some help by signaling an error. Because we’re doing everything statically, we know the two possibilities and the compiler can give the programmer a choice:

sum (length xss) is ambiguous. Possible elaborations:

(1)  sum (rep (length xss))
(2)  sum (map length xss)

We can conceive of an interactive system where the user just types 1 or 2 to make a choice and disambiguate (although having the compiler modify the programmer’s source code in this way probably isn’t a particularly good design). A better (and simpler) approach is a system where the user has to manually disambiguate. In practice, it’s actually pretty nice too! Why? Because rep and map are just normal Futhark functions and can always be inserted by the user to disambiguate. If the user wants option 2, they can just plop a map in the source code in front of length to disambiguate.

This is actually a pretty cool feature: rank polymorphism in Futhark is basically a kind of syntactic sugar that lets you leave maps and reps implicit. You can always disambiguate when necessary by just inserting maps or reps yourself (usually just one) to make the implicit map or rep explicit. It also means you can ask the compiler to elaborate all implicit maps and reps so you can always see what’s going on. No mystery here!

All applications must be considered simultaneously

Another thing to notice in the

sum (map length xss)

elaboration of sum (length xss) is that we had to consider the outer application with sum to know how to map the inner length application. In general, lifting one application can affect others that depend on it because mapping changes the rank of the application itself. The implication of this is quite significant: inserting maps or reps to make an application rank-correct requires considering the entire top-level definition at hand (rather than just the local application).

This doesn’t mean it’s impossible to do rank polymorphism by only considering local applications one-by-one. If we had just considered the inner length xss application in isolation, Rule 2 would yield length xss without any map. Proceeding to the outer application then yields the other possible minimal elaboration:

sum (rep (length xss))

But this was just good luck; only considering applications one-by-one will not lead to a minimal solution in general. It will also often lead to “ugly” elaborations that the programmer probably didn’t intend. Minimality in the number of maps and reps across a function’s body as a whole is a useful target because we’ve found it usually corresponds with what the programmer intends; non-minimality is often the product of needing extra maps and reps to overcome the extra dimensions introduced by maps and reps in earlier applications. In practice, programmers don’t write code like this and hence the subsequent elaboration by the compiler probably won’t correspond to their intent.

Now that we have our objective and strategy in hand, there’s only one thing left to do—find the rank-correct elaborations while heeding Rule 1 and Rule 2!

Finding minimal rank-correct elaborations

The first step involves traversing the program and, for every application, generating constraints that encapsulate the requirements to obtain a rank-correct program. We (or, rather, the compiler) can then solve those constraints to figure out how to insert maps or reps to make things rank-correct.

Consider the application of some function f : a -> b to an argument x : c:

f x

In a normal application, we require that the type of function’s parameter and the actual argument are equal, so we’d have the constraint

a = c

In our context, we’re only interested in rank equality (actual equality will be handled by the type checker later, after all maps and reps have been inserted and the applications are rank-correct), so we relax the constraint to just enforcing rank equality, which we write like so:

|a| = |c|

where |a| means “the rank of a”. For example, |[][]int| = 2 and |int| = 0.

Now we want to relax things further and allow for rank differences. If c has fewer dimensions than a (i.e. |c| < |a|) we’ll have to rep it. We don’t know the rank difference so we don’t know how much to replicate it by yet (how many reps we need), so we’ll represent this unknown quantity with the rank variable R:

|a| = R + |c|

For example, if f : [][]int -> int and x : []int then we’d have

|[][]int| = R + |[]int|

which is equal to

2 = R + 1

We can clearly see that this constraint holds when R = 1; this means we need one rep to make the application rank-correct and f (rep x) would be a valid elaboration.

The other way a and c might have a rank mismatch is that c has more dimensions than a (i.e. |c| > |a|), in which case we’ll have to map. We again don’t know how many maps we’ll need, so we’ll represent it with the rank variable M:

M + |a| = |c|

Now, for example, if f : int -> int and x : [][]int then we’d have

M + |int| =  |[][] int|

which is equivalent to

M = 2

This means we have to add two maps to make the application rank-correct: map (map f) x.

Of course—since we don’t know anything about the rank difference in general—we don’t know if we need to map or rep so we need to include both R and M in our rank constraint:

M + |a| = R + |c|

We still need to enforce our first rule—that we either map or rep but not both. To do so, we just require that either M = 0 or R = 0 via an additional constraint. Hence, every function application f x generates two constraints:

M + |a| = R + |c|


M = 0 or R = 0

During type checking, we accumulate these constraints into a big list. For example, when checking sum (length xss), we first check length xss (recalling that length : []a -> int and xss : [][]int). This yields the constraints

M₁ + 1 + |a| = R₁ + 2
M₁ = 0 or R₁ = 0

(Where we’ve added the subscript 1 to the rank variables, since we’ll have two sets of rank variables from the two applications and need to distinguish them.) Notice that this time the first constraint includes the rank of a type variable: |a|. This is necessary because the ranks of type variables are not immediately known (they’re type variables and can be anything). You should think of |a| as just another rank variable that the system will have to solve for. To solve the above constraints, we have to find a mapping that assigns a rank to M₁, R₁ and |a|.

Now we have to check the outer application involving sum. We have to be a bit careful when constructing the requisite rank equality constraint between the parameters of sum and its argument length xss. If the rank variable M₁ above is assigned a non-zero value, then that means that the application length xss has implicit maps—namely M₁ of them. This means that (recalling that length : []a -> int) the type of the term length xss will not be int but instead be an M₁-dimensional int array because each implicit map will boost the rank of the application itself by 1. So, we have to tack on extra dimensions to the type of the application corresponding to each implicit map: length xss : []ᴹ¹ int, where []ᴹ¹ is just a sequence of [] of length M₁ (and where we understand length xss to have M₁ implicit maps). If we end up inferring that no maps are to be inserted, then M₁ = 0 and we’ll have length xss : []⁰ int which is equivalent to length xss : int, as expected.

As an example, if xss : [][]int then the explicitly elaborated application map length xss : []int is rank-correct and hence we assign the implicit application length xss the type []int.

So, when generating constraints for sum (length xss) we’re really generating constraints for sum (length xss : []ᴹ¹ int) because of the (potential) implicit maps in length xss. Recalling that sum : []int -> int), we have

M₂ + |[]| = R₂ + |[]ᴹ¹ int|
M₂ = 0 or R₂ = 0

which is equivalent to

M₂ + 1 = R₂ + M₁
M₂ = 0 or R₂ = 0

Notice that the rank equality constraint depends on M₁ (the map rank variable generated in the constraints for length xss). This is an explicit witness for the fact that how you insert maps at one application affects subsequent applications that depend on it.

Combining the constraints, the constraint set that expresses the rank requirements for sum (length xss) is

M₁ + 1 + |a| = R₁ + 2
M₁ = 0 or R₁ = 0
M₂ + 1 = R₂ + M₁
M₂ = 0 or R₂ = 0

What’s left is to find a minimal assignment for M₁, R₁, M₂, R₂, and |a| (and more generally for any constraint set generated in this way) such that the above constraints hold .

Solving the constraints

Our first order of business is figuring out how to get a minimal solution (that is, a way to enforce Rule 2—recall that Rule 1 is already enforced by the constraints themselves). Since this just requires that the number of maps and reps (which are represented by the M and R rank variables, respectively) is minimal, this is the same thing as minimizing the sum of the rank variables. For the example above, that corresponds to the following minimization problem:

  M₁ + R₁ + M₂ + R₂

subject to
  M₁ + 1 + |a| = R₁ + 2
  M₁ = 0 or R₁ = 0
  M₂ + 1 = R₂ + M₁
  M₂ = 0 or R₂ = 0

If you’re familiar with integer linear programs (ILPs), you may have noticed that this looks a lot like one except for the or-constraints enforcing Rule 1, which aren’t linear. Fortunately, we can linearize those constraints using some linear programming trickery: for each pair of rank variables Mᵢ and Rᵢ, we introduce two binary variables b_Mᵢ and b_Rᵢ that indicate whether the application is to be mapped or repped. The idea is to enforce that Mᵢ > 0 only if b_Mᵢ = 1 (and analogously for Rᵢ and b_Rᵢ). Then, by requiring that b_Mᵢ + b_Rᵢ ≤ 1, we then enforce that either Mᵢ = 0 or Rᵢ = 0. To establish the requisite relationship between Mᵢ and b_Mᵢ, we add the constraint

Mᵢ ≤ b_Mᵢ * BIG_NUMBER

where BIG_NUMBER is just a big number. When b_Mᵢ = 0, we have Mᵢ ≤ 0 forcing it to be 0. When b_Mᵢ = 1, we have Mᵢ ≤ BIG_NUMBER which effectively lets the solver assign Mᵢ to be any rank-satisfying value (as long as BIG_NUMBER is bigger than any rank we expect to see in a program—in practice ranks are small so it isn’t a big deal). We do something analogous for Rᵢ and b_Rᵢ to obtain the linearized minimization problem

  M₁ + R₁ + M₂ + R₂

subject to
  M₁ + 1 + |a| = R₁ + 2
  b_M₁ + b_R₁ ≤ 1
  M₁ ≤ b_M₁ * BIG_NUMBER
  R₁ ≤ b_R₁ * BIG_NUMBER
  M₂ + 1 = R₂ + M₁
  b_M₂ + b_R₂ ≤ 1
  M₂ ≤ b_M₂ * BIG_NUMBER
  R₂ ≤ b_R₂ * BIG_NUMBER

Now we have ourselves a real ILP! To solve it, we can just feed it into any existing solver (there are many, in our prototype we used GLPK) or implement one of the known algorithms for ILP solving. Technically, ILP is in NP (all known solvers have exponential runtime) but in practice solvers are very fast and the ILPs we generate are small, so even if you’re solving a bunch of these ILPs when type checking a large program it still makes for a very usable system. (Besides, normal type inference has exponential runtime anyway.)

Putting it all together

We now have all the ingredients we need to bestow Futhark with the power of rank polymorphism, a.k.a AUTOMAP. Here’s how the AUTOMAP system works from start-to-finish for each top-level definition in a program:

  1. For each application generate a rank equality constraint and a Rule 1 constraint and add it to the constraint set.

  2. Transform the constraint set into an ILP and solve it. If the ILP has no solutions or multiple solutions, fail and report an error to the user.

  3. The solution to the ILP tells the compiler how to insert the minimal number of maps and reps to make the program rank-correct. For example, if Mᵢ = 3 and Rᵢ = 0 for an application f x, then transform the application by inserting three maps and no reps: f x ⟶ map (map (map f)) x.

  4. Type check the transformed program with all implicit maps and reps made explicit using Futhark’s normal type checker and continue with compilation as usual.

The extra nice thing here is that we can do all this without affecting how the rest of the compiler works. This is a direct result of the fact that Futhark treats rank polymorphism as syntactic sugar and there’s no “magic” going on—all implicit constructs (map and rep) are just standard functions already available to the programmer. This also means that the programmer is free to use the AUTOMAP to whatever extent that they wish—from omitting all maps and reps (up to ambiguity), to only omitting some, to having a program with all maps and reps explicit.

That’s it! P.S. In case you’re wondering—we call the system AUTOMAP instead of AUTOMAPREP because implicit maps are way more common than implicit reps (and AUTOMAPREP isn’t as catchy).


  1. Of course you could be rigorous here and also add something along the lines of “define (x + y)_i = x_i + y_i” (where x_i is the ith component of x) to your notes so that the operation is well-defined. With this approach there are really two definitions of +—one for scalars and one for vectors and we’ve just grouped them under the same name. From a programming languages perspective, this corresponds to ad-hoc polymorphism (a.k.a function/operator overloading) wherein you group a bunch of functions (which expect different inputs) under the same name.↩︎

  2. In most languages with rank polymorphism, this isn’t a property of function applications but rather of functions themselves. Typed Remora is a nice example because it demonstrates this explicitly at the type-level: just like how we abstract over type parameters in parametric polymorphism (e.g., id : forall a. a -> a), Remora abstracts over shape parameters as well (“shape polymorphic” might be a better name than “rank polymorphic”). For example, sum : forall S. S int -> int (where S is a shape variable) might collapse any higher-dimensional array (say a matrix or a vector or a 36-dimensional tensor) into an integer. Unfortunately, this generality comes at a cost and makes type inference very, very challenging.↩︎

  3. You may be wondering why you couldn’t just type check function applications modulo rank mismatches and delay deciding how to map and rep until runtime. There are multiple reasons that is a bad idea aside from the usual dynamic rank polymorphism drawbacks: the ensuing lack of precise static type information would also weaken type checking/inference, make many optimizations intractable, and basically muck-up the existent compiler from end to end.↩︎