Fork me on GitHub

# AUTOMAP: How to do NumPy-style broadcasting in Futhark (but better)

Posted on June 17, 2024 by Robert Schenck

Troels (the guy who usually writes the blog posts) asked me (Robert) to write about some recent work in Futhark that we call AUTOMAP, which allows you to do something akin to NumPy-style broadcasting in Futhark. I’m a PhD student here at DIKU and I’ve worked on Futhark one way or another for most of my PhD. (P.S. I’ll be graduating in December 2024 and I’m actively looking for potential postdoc positions in all things type systems/functional programming/compilers—please get in touch if you think I might be a match!)

AUTOMAP isn’t available in Futhark’s `master` branch quite yet (we’ve implemented a prototype) but we expect to merge it in the nearish future. We’ve also written a paper on AUTOMAP (currently in review); I’m going to give a talk based on it at ARRAY24 on June 25th.

When Troels first suggested adding something like broadcasting to Futhark, we thought it’d be a fun/simple feature and a good break from the heavy duty work on automatic differentiation we had just published. In short, we were quite wrong about that and there was a surprising amount of complexity involved, but we ended up with a really cool (and dare I say elegant) solution.

The motivation for the work was a desire to make mathematical code involving higher-dimensional arguments look closer to something you’d write by hand, without the clutter introduced by a bunch of “administrative” maps that are needed to lift scalar operations. For example, if you want to add the two vectors (represented by arrays) `[1,2,3]` and `[4,5,6]` together, you have to lift `+` to work over the arrays:

``map2 (+) [1,2,3] [4,5,6]``

But if you were writing some math notes by hand, you’d probably just write:

``[1,2,3] + [4,5,6]``

with the tacit understanding that what you really mean is the piecewise addition of the two vectors.1 In fact, piecewise addition is really the only reasonable thing to do here if the idea is to “make `+` work for these higher-dimensional arguments”—shouldn’t we be able to write `[1,2,3] + [4,5,6]` and have the compiler figure out that we really meant `map2 (+) [1,2,3] [4,5,6]`? This isn’t only a boon for the lazy programmer; omitting the “noise” of the map also makes things clearer and lets us use `+` in its proper infix location (sorry, Lisp programmers).

### A few examples

As a first step, let’s try to pin down what it is that we want the compiler to do by looking at a few more examples. How about subtracting one matrix from another:

``````[[1,2,3],   [[9,8,7],
[4,5,6], -  [6,5,4],
[7,8,9]]    [3,2,1]]``````

Now we have to lift `-` to work over two-dimensional arrays instead of one-dimensional, so we apply `map2` twice:

``````                [[1,2,3], [[9,8,7],
map2 (map2 (-))  [4,5,6],  [6,5,4],
[7,8,9]]  [3,2,1]]``````

For simplicity, we’re also going to assume `map` is an n-argument map that works for any number of arguments. So `map (+1) [1,2,3] = [2,3,4]`, `map (+) [1,2] [3,4] = [4,6]`, and so on. Hence the above can be expressed as just

``````              [[1,2,3], [[9,8,7],
map (map (-))  [4,5,6],  [6,5,4],
[7,8,9]]  [3,2,1]]``````

What happens if we add together two things of different ranks? (“Rank” is just the number of dimensions an array has; for example a matrix has rank 2 and an integer has rank 0.) Let’s try a matrix and a vector:

``````[[1,2,3],
[4,5,6],  + [10,20,30]
[7,8,9]]``````

Since a matrix (represented by a two-dimensional array) is involved, we know we we’ll have to lift twice:

``````              [[1,2,3],
map (map (+))  [4,5,6],  [10,20,30]
[7,8,9]]``````

But this isn’t enough because to add these two piecewise we need to be able to pair up each element of the matrix with an element of the vector. We’ll have to boost the dimensionality of the vector by replicating it to match the shape of the matrix:

``````              [[1,2,3],
map (map (+))  [4,5,6],  (replicate 3 [10,20,30])
[7,8,9]]``````

We actually don’t care about the specific number of times that we have to replicate the vector to make it match the matrix because because the compiler can figure it out later. So, to keep things simpler, we’ll use a “size-free” version of `replicate` that we call `rep : a -> []a`. It just tacks on an extra dimension by replicating its argument an unknown number of times (that will be determined later). Think of `rep` as `replicate n` where the correct `n` is chosen automatically. So, using `rep` instead of `replicate`, the above becomes

``````              [[1,2,3],
map (map (+))  [4,5,6],  (rep [10,20,30])
[7,8,9]]``````

In the above examples, we’re applying scalar operators to higher-ranked arguments and then massaging the application into something that makes sense/is rank-correct by inserting `map`s and `rep`s. This works for more than just scalar operators: any function application involving a function whose parameters have a rank mismatch with the arguments can be transformed into a rank-correct application in this way.

As a final more general example, let’s look at a function `f : []int -> [][]int -> int -> int` applied to argument an `xss : [][]int`:

``f xss xss xss``

The above application can be made rank-correct by inserting the following `map`s and `rep`s

``map ((map f xss) (rep xss)) xss``

(All functions in Futhark are curried; `f xss xss xss` is really `((f xss) xss) xss` and is hence actually three separate applications from the perspective of inserting `map`s and `rep`s.) This approach to making applications rank-correct is completely agnostic to infix/prefix application or any specific properties a function may have—an application can always be made rank-correct by insertion of `map`s and `rep`s. If the compiler can do said insertion automatically, then we can always apply functions to arguments of any rank. Programming languages that support function application like this are called rank polymorphic.2 Our task, then, is to endow Futhark with rank polymorphism by giving it the capability to automatically infer `map`s and `rep`s to make function applications rank-correct.

### (Static) rank polymorphism is hard

Most rank polymorphic languages are dynamic: they don’t do the requisite lifting and replicating until runtime, when the arguments to a function are fully known. NumPy and APL are probably the two most popular examples of dynamic rank polymorphic languages.

It’s generally easier to do rank polymorphism in dynamic programming systems like NumPy because you don’t have to figure out how exactly to make a function application rank-correct until runtime, when everything (function and arguments) is fully known. Of course, there are drawbacks including runtime overhead and a lack of transparency to the user (the only way to inspect how a function is lifted/replicated is to run the program).

In a statically-typed language like Futhark, things are harder. How to make a function application rank-correct (how to insert `map`s and `rep`s) has to be decided statically during type checking, without any runtime information whatsoever.3 This is made especially hard by the fact that Futhark supports parametric polymorphism (that is, you can write general functions with type variables like `id : a -> a` or `map : (a -> b) -> []a -> []b`)—a type variable can stand for any type and hence have any rank, which complicates reasoning about how to insert `map`s and `rep`s.

### Ambiguity makes things harder still

There are more challenges on the horizon: there are often many different ways to insert `map`s and `rep`s to make a function application rank-correct. A simple example is

``sum (length xss)``

where `sum : []int -> int`, `length : []a -> int`, and `xss : [][]int`. One way to make this application rank correct is by not modifying the `length xss` application at all and instead `rep` it when it’s passed to `sum` as an argument:

``sum (rep (length xss))``

But if you’re like me and lone `rep`s give you the ick, you’ll probably prefer

``sum (map length xss)``

where the inner application with `length` is lifted so that the outer application with `sum` doesn’t have to `rep` its argument. Or we might just go totally crazy and come up with this rank-correct concoction:

``map sum (map (map length) (rep xss))``

In fact, there are actually an infinite number of ways that we can insert `map`s and `rep`s to make the application rank-correct. What should we choose and how should we choose it? We’re going to need a strategy!

### The Strategy

Our first step in coming up with a system in which the compiler can make a reasonable choice on how to `map`/`rep` an application is laying some ground rules. There are two:

Rule 1: An application can be `map`ped or `rep`ped (or neither) but never both.

This means, given `f x`, we can write (as long as it’s rank-correct to do so) `map f x`, `map (map f) x`, or `f (rep (rep x))`, but writing `map f (rep x)` or any variation that includes a non-zero number of `map`s and `rep`s isn’t allowed.

This restriction exists because it’s never necessary to both `map` and `rep` in a single application since there are only two possible ways that there can be a rank mismatch: either the argument’s rank is higher than what the function expects (and hence a `map` (or multiple) is required) or the argument’s rank is less than what the function expects (and one or more `rep`s is required).

Note that the example

``````[[1,2,3],
[4,5,6],  + [10,20,30]
[7,8,9]]``````

which was elaborated into

``````              [[1,2,3],
map (map (+))  [4,5,6],  (rep [10,20,30])
[7,8,9]]``````

doesn’t violate Rule 1 because functions are curried—the above is actually two separate applications, one with `map`s and the other with a `rep`.

Rule 2: Minimize the number of inserted `map`s and `rep`s.

This rule says that we want the compiler to insert the fewest number of `map`s and `rep`s to make the program rank-correct. We found that this generally aligns with the programmer’s intent and that it also provides for a simple mental model for the programmer. For example, if `ysss : [][][]int` is a three-dimensional array, no `map`s or `rep`s will be inserted at all for the program `length ysss` even though `map length ysss` and `map (map length) ysss` are both legal elaborations.

This also helps us address ambiguity: the compiler no longer has to choose from an infinite number of solutions and instead only the minimal ones. Unfortunately, sometimes there are multiple minimal solutions and hence Rule 2 doesn’t entirely free us from the possibility of ambiguity.

### Even with The Strategy, ambiguities may exist

Sometimes Rule 1 and Rule 2 aren’t enough. Returning to our previous example involving `sum` and `length`, observe that

``sum (rep (length xss))``

and

``sum (map length xss)``

are both minimal because they have the same number of inserted operations (one each). When there are multiple minimal elaborations, we say that the program is ambiguous. In this case, we have to ask the programmer for some help by signaling an error. Because we’re doing everything statically, we know the two possibilities and the compiler can give the programmer a choice:

``````sum (length xss) is ambiguous. Possible elaborations:

(1)  sum (rep (length xss))
(2)  sum (map length xss)``````

We can conceive of an interactive system where the user just types `1` or `2` to make a choice and disambiguate (although having the compiler modify the programmer’s source code in this way probably isn’t a particularly good design). A better (and simpler) approach is a system where the user has to manually disambiguate. In practice, it’s actually pretty nice too! Why? Because `rep` and `map` are just normal Futhark functions and can always be inserted by the user to disambiguate. If the user wants option 2, they can just plop a `map` in the source code in front of `length` to disambiguate.

This is actually a pretty cool feature: rank polymorphism in Futhark is basically a kind of syntactic sugar that lets you leave `map`s and `rep`s implicit. You can always disambiguate when necessary by just inserting `map`s or `rep`s yourself (usually just one) to make the implicit `map` or `rep` explicit. It also means you can ask the compiler to elaborate all implicit `map`s and `rep`s so you can always see what’s going on. No mystery here!

### All applications must be considered simultaneously

Another thing to notice in the

``sum (map length xss)``

elaboration of `sum (length xss)` is that we had to consider the outer application with `sum` to know how to `map` the inner `length` application. In general, lifting one application can affect others that depend on it because `map`ping changes the rank of the application itself. The implication of this is quite significant: inserting `map`s or `rep`s to make an application rank-correct requires considering the entire top-level definition at hand (rather than just the local application).

This doesn’t mean it’s impossible to do rank polymorphism by only considering local applications one-by-one. If we had just considered the inner `length xss` application in isolation, Rule 2 would yield `length xss` without any `map`. Proceeding to the outer application then yields the other possible minimal elaboration:

``sum (rep (length xss))``

But this was just good luck; only considering applications one-by-one will not lead to a minimal solution in general. It will also often lead to “ugly” elaborations that the programmer probably didn’t intend. Minimality in the number of `map`s and `rep`s across a function’s body as a whole is a useful target because we’ve found it usually corresponds with what the programmer intends; non-minimality is often the product of needing extra `map`s and `rep`s to overcome the extra dimensions introduced by `map`s and `rep`s in earlier applications. In practice, programmers don’t write code like this and hence the subsequent elaboration by the compiler probably won’t correspond to their intent.

Now that we have our objective and strategy in hand, there’s only one thing left to do—find the rank-correct elaborations while heeding Rule 1 and Rule 2!

### Finding minimal rank-correct elaborations

The first step involves traversing the program and, for every application, generating constraints that encapsulate the requirements to obtain a rank-correct program. We (or, rather, the compiler) can then solve those constraints to figure out how to insert `map`s or `rep`s to make things rank-correct.

Consider the application of some function `f : a -> b` to an argument `x : c`:

``f x``

In a normal application, we require that the type of function’s parameter and the actual argument are equal, so we’d have the constraint

``a = c``

In our context, we’re only interested in rank equality (actual equality will be handled by the type checker later, after all `map`s and `rep`s have been inserted and the applications are rank-correct), so we relax the constraint to just enforcing rank equality, which we write like so:

``|a| = |c|``

where `|a|` means “the rank of `a`”. For example, `|[][]int| = 2` and `|int| = 0`.

Now we want to relax things further and allow for rank differences. If `c` has fewer dimensions than `a` (i.e. `|c| < |a|`) we’ll have to `rep` it. We don’t know the rank difference so we don’t know how much to replicate it by yet (how many `rep`s we need), so we’ll represent this unknown quantity with the rank variable `R`:

``|a| = R + |c|``

For example, if `f : [][]int -> int` and `x : []int` then we’d have

``|[][]int| = R + |[]int|``

which is equal to

``2 = R + 1``

We can clearly see that this constraint holds when `R = 1`; this means we need one `rep` to make the application rank-correct and `f (rep x)` would be a valid elaboration.

The other way `a` and `c` might have a rank mismatch is that `c` has more dimensions than `a` (i.e. `|c| > |a|`), in which case we’ll have to `map`. We again don’t know how many `map`s we’ll need, so we’ll represent it with the rank variable `M`:

``M + |a| = |c|``

Now, for example, if `f : int -> int` and `x : [][]int` then we’d have

``M + |int| =  |[][] int|``

which is equivalent to

``M = 2``

This means we have to add two `map`s to make the application rank-correct: `map (map f) x`.

Of course—since we don’t know anything about the rank difference in general—we don’t know if we need to `map` or `rep` so we need to include both `R` and `M` in our rank constraint:

``M + |a| = R + |c|``

We still need to enforce our first rule—that we either `map` or `rep` but not both. To do so, we just require that either `M = 0` or `R = 0` via an additional constraint. Hence, every function application `f x` generates two constraints:

``M + |a| = R + |c|``

and

``M = 0 or R = 0``

During type checking, we accumulate these constraints into a big list. For example, when checking `sum (length xss)`, we first check `length xss` (recalling that `length : []a -> int` and `xss : [][]int`). This yields the constraints

``````M₁ + 1 + |a| = R₁ + 2
M₁ = 0 or R₁ = 0``````

(Where we’ve added the subscript `1` to the rank variables, since we’ll have two sets of rank variables from the two applications and need to distinguish them.) Notice that this time the first constraint includes the rank of a type variable: `|a|`. This is necessary because the ranks of type variables are not immediately known (they’re type variables and can be anything). You should think of `|a|` as just another rank variable that the system will have to solve for. To solve the above constraints, we have to find a mapping that assigns a rank to `M₁`, `R₁` and `|a|`.

Now we have to check the outer application involving `sum`. We have to be a bit careful when constructing the requisite rank equality constraint between the parameters of `sum` and its argument `length xss`. If the rank variable `M₁` above is assigned a non-zero value, then that means that the application `length xss` has implicit maps—namely `M₁` of them. This means that (recalling that `length : []a -> int`) the type of the term `length xss` will not be `int` but instead be an `M₁`-dimensional `int` array because each implicit `map` will boost the rank of the application itself by 1. So, we have to tack on extra dimensions to the type of the application corresponding to each implicit `map`: `length xss : []ᴹ¹ int`, where `[]ᴹ¹` is just a sequence of `[]` of length `M₁` (and where we understand `length xss` to have `M₁` implicit maps). If we end up inferring that no `map`s are to be inserted, then `M₁ = 0` and we’ll have `length xss : []⁰ int` which is equivalent to `length xss : int`, as expected.

As an example, if `xss : [][]int` then the explicitly elaborated application `map length xss : []int` is rank-correct and hence we assign the implicit application `length xss` the type `[]int`.

So, when generating constraints for `sum (length xss)` we’re really generating constraints for `sum (length xss : []ᴹ¹ int)` because of the (potential) implicit maps in `length xss`. Recalling that `sum : []int -> int`), we have

``````M₂ + |[]| = R₂ + |[]ᴹ¹ int|
M₂ = 0 or R₂ = 0``````

which is equivalent to

``````M₂ + 1 = R₂ + M₁
M₂ = 0 or R₂ = 0``````

Notice that the rank equality constraint depends on `M₁` (the `map` rank variable generated in the constraints for `length xss`). This is an explicit witness for the fact that how you insert `map`s at one application affects subsequent applications that depend on it.

Combining the constraints, the constraint set that expresses the rank requirements for `sum (length xss)` is

``````M₁ + 1 + |a| = R₁ + 2
M₁ = 0 or R₁ = 0
M₂ + 1 = R₂ + M₁
M₂ = 0 or R₂ = 0``````

What’s left is to find a minimal assignment for `M₁`, `R₁`, `M₂`, `R₂`, and `|a|` (and more generally for any constraint set generated in this way) such that the above constraints hold .

### Solving the constraints

Our first order of business is figuring out how to get a minimal solution (that is, a way to enforce Rule 2—recall that Rule 1 is already enforced by the constraints themselves). Since this just requires that the number of `map`s and `rep`s (which are represented by the `M` and `R` rank variables, respectively) is minimal, this is the same thing as minimizing the sum of the rank variables. For the example above, that corresponds to the following minimization problem:

``````minimize
M₁ + R₁ + M₂ + R₂

subject to
M₁ + 1 + |a| = R₁ + 2
M₁ = 0 or R₁ = 0
M₂ + 1 = R₂ + M₁
M₂ = 0 or R₂ = 0``````

If you’re familiar with integer linear programs (ILPs), you may have noticed that this looks a lot like one except for the or-constraints enforcing Rule 1, which aren’t linear. Fortunately, we can linearize those constraints using some linear programming trickery: for each pair of rank variables `Mᵢ` and `Rᵢ`, we introduce two binary variables `b_Mᵢ` and `b_Rᵢ` that indicate whether the application is to be `map`ped or `rep`ped. The idea is to enforce that `Mᵢ > 0` only if `b_Mᵢ = 1` (and analogously for `Rᵢ` and `b_Rᵢ`). Then, by requiring that `b_Mᵢ + b_Rᵢ ≤ 1`, we then enforce that either `Mᵢ = 0` or `Rᵢ = 0`. To establish the requisite relationship between `Mᵢ` and `b_Mᵢ`, we add the constraint

``Mᵢ ≤ b_Mᵢ * BIG_NUMBER``

where `BIG_NUMBER` is just a big number. When `b_Mᵢ = 0`, we have `Mᵢ ≤ 0` forcing it to be 0. When `b_Mᵢ = 1`, we have `Mᵢ ≤ BIG_NUMBER` which effectively lets the solver assign `Mᵢ` to be any rank-satisfying value (as long as `BIG_NUMBER` is bigger than any rank we expect to see in a program—in practice ranks are small so it isn’t a big deal). We do something analogous for `Rᵢ` and `b_Rᵢ` to obtain the linearized minimization problem

``````minimize
M₁ + R₁ + M₂ + R₂

subject to
M₁ + 1 + |a| = R₁ + 2
b_M₁ + b_R₁ ≤ 1
M₁ ≤ b_M₁ * BIG_NUMBER
R₁ ≤ b_R₁ * BIG_NUMBER
M₂ + 1 = R₂ + M₁
b_M₂ + b_R₂ ≤ 1
M₂ ≤ b_M₂ * BIG_NUMBER
R₂ ≤ b_R₂ * BIG_NUMBER
``````

Now we have ourselves a real ILP! To solve it, we can just feed it into any existing solver (there are many, in our prototype we used GLPK) or implement one of the known algorithms for ILP solving. Technically, ILP is in NP (all known solvers have exponential runtime) but in practice solvers are very fast and the ILPs we generate are small, so even if you’re solving a bunch of these ILPs when type checking a large program it still makes for a very usable system. (Besides, normal type inference has exponential runtime anyway.)

### Putting it all together

We now have all the ingredients we need to bestow Futhark with the power of rank polymorphism, a.k.a AUTOMAP. Here’s how the AUTOMAP system works from start-to-finish for each top-level definition in a program:

1. For each application generate a rank equality constraint and a Rule 1 constraint and add it to the constraint set.

2. Transform the constraint set into an ILP and solve it. If the ILP has no solutions or multiple solutions, fail and report an error to the user.

3. The solution to the ILP tells the compiler how to insert the minimal number of `map`s and `rep`s to make the program rank-correct. For example, if `Mᵢ = 3` and `Rᵢ = 0` for an application `f x`, then transform the application by inserting three `map`s and no `rep`s: `f x ⟶ map (map (map f)) x`.

4. Type check the transformed program with all implicit `map`s and `rep`s made explicit using Futhark’s normal type checker and continue with compilation as usual.

The extra nice thing here is that we can do all this without affecting how the rest of the compiler works. This is a direct result of the fact that Futhark treats rank polymorphism as syntactic sugar and there’s no “magic” going on—all implicit constructs (`map` and `rep`) are just standard functions already available to the programmer. This also means that the programmer is free to use the AUTOMAP to whatever extent that they wish—from omitting all `map`s and `rep`s (up to ambiguity), to only omitting some, to having a program with all `map`s and `rep`s explicit.

That’s it! P.S. In case you’re wondering—we call the system AUTOMAP instead of AUTOMAPREP because implicit `map`s are way more common than implicit `rep`s (and AUTOMAPREP isn’t as catchy).

Footnotes

1. Of course you could be rigorous here and also add something along the lines of “define `(x + y)_i = x_i + y_i`” (where `x_i` is the `i`th component of `x`) to your notes so that the operation is well-defined. With this approach there are really two definitions of `+`—one for scalars and one for vectors and we’ve just grouped them under the same name. From a programming languages perspective, this corresponds to ad-hoc polymorphism (a.k.a function/operator overloading) wherein you group a bunch of functions (which expect different inputs) under the same name.↩︎

2. In most languages with rank polymorphism, this isn’t a property of function applications but rather of functions themselves. Typed Remora is a nice example because it demonstrates this explicitly at the type-level: just like how we abstract over type parameters in parametric polymorphism (e.g., `id : forall a. a -> a`), Remora abstracts over shape parameters as well (“shape polymorphic” might be a better name than “rank polymorphic”). For example, `sum : forall S. S int -> int` (where `S` is a shape variable) might collapse any higher-dimensional array (say a matrix or a vector or a 36-dimensional tensor) into an integer. Unfortunately, this generality comes at a cost and makes type inference very, very challenging.↩︎

3. You may be wondering why you couldn’t just type check function applications modulo rank mismatches and delay deciding how to `map` and `rep` until runtime. There are multiple reasons that is a bad idea aside from the usual dynamic rank polymorphism drawbacks: the ensuing lack of precise static type information would also weaken type checking/inference, make many optimizations intractable, and basically muck-up the existent compiler from end to end.↩︎