# AUTOMAP: How to do NumPy-style broadcasting in Futhark (but better)

Troels (the guy who usually writes the blog posts) asked me (Robert) to write about some recent work in Futhark that we call AUTOMAP, which allows you to do something akin to NumPy-style broadcasting in Futhark. I’m a PhD student here at DIKU and I’ve worked on Futhark one way or another for most of my PhD. (P.S. I’ll be graduating in December 2024 and I’m actively looking for potential postdoc positions in all things type systems/functional programming/compilers—please get in touch if you think I might be a match!)

AUTOMAP isn’t available in Futhark’s `master`

branch quite yet (we’ve
implemented a prototype)
but we expect to merge it in the nearish future. We’ve also written a paper on
AUTOMAP (currently in review); I’m going to give a talk based on it at
ARRAY24 on June 25th.

When Troels first suggested adding something like broadcasting to Futhark, we thought it’d be a fun/simple feature and a good break from the heavy duty work on automatic differentiation we had just published. In short, we were quite wrong about that and there was a surprising amount of complexity involved, but we ended up with a really cool (and dare I say elegant) solution.

The motivation for the work was a desire to make mathematical code involving
higher-dimensional arguments look closer to something you’d write by hand,
without the clutter introduced by a bunch of “administrative” maps that are
needed to lift scalar operations. For example, if you want to add the two
vectors (represented by arrays) `[1,2,3]`

and `[4,5,6]`

together, you have to
lift `+`

to work over the arrays:

`1,2,3] [4,5,6] map2 (+) [`

But if you were writing some math notes by hand, you’d probably just write:

`1,2,3] + [4,5,6] [`

with the tacit understanding that what you really mean is the piecewise addition
of the two vectors.^{1} In fact, piecewise addition is really the only
reasonable thing to do here if the idea is to “make `+`

work for these
higher-dimensional arguments”—shouldn’t we be able to write `[1,2,3] + [4,5,6]`

and have the compiler figure out that we really meant `map2 (+) [1,2,3] [4,5,6]`

? This isn’t only a boon for the lazy programmer; omitting the “noise”
of the map also makes things clearer and lets us use `+`

in its proper infix
location (sorry, Lisp programmers).

### A few examples

As a first step, let’s try to pin down what it is that we want the compiler to do by looking at a few more examples. How about subtracting one matrix from another:

```
1,2,3], [[9,8,7],
[[4,5,6], - [6,5,4],
[7,8,9]] [3,2,1]] [
```

Now we have to lift `-`

to work over *two-dimensional* arrays instead of
one-dimensional, so we apply `map2`

twice:

```
1,2,3], [[9,8,7],
[[4,5,6], [6,5,4],
map2 (map2 (-)) [7,8,9]] [3,2,1]] [
```

For simplicity, we’re also going to assume `map`

is an n-argument map that works
for any number of arguments. So `map (+1) [1,2,3] = [2,3,4]`

, `map (+) [1,2] [3,4] = [4,6]`

, and so on. Hence the above can be expressed as just

```
1,2,3], [[9,8,7],
[[4,5,6], [6,5,4],
map (map (-)) [7,8,9]] [3,2,1]] [
```

What happens if we add together two things of different ranks? (“Rank” is just the number of dimensions an array has; for example a matrix has rank 2 and an integer has rank 0.) Let’s try a matrix and a vector:

```
1,2,3],
[[4,5,6], + [10,20,30]
[7,8,9]] [
```

Since a matrix (represented by a two-dimensional array) is involved, we know we we’ll have to lift twice:

```
1,2,3],
[[4,5,6], [10,20,30]
map (map (+)) [7,8,9]] [
```

But this isn’t enough because to add these two piecewise we need to be able to pair up each element of the matrix with an element of the vector. We’ll have to boost the dimensionality of the vector by replicating it to match the shape of the matrix:

```
1,2,3],
[[4,5,6], (replicate 3 [10,20,30])
map (map (+)) [7,8,9]] [
```

We actually don’t care about the specific number of times that we have to
replicate the vector to make it match the matrix because because the compiler
can figure it
out
later. So, to keep things simpler, we’ll use a “size-free” version of
`replicate`

that we call `rep : a -> []a`

. It just tacks on an extra dimension
by replicating its argument an unknown number of times (that will be determined
later). Think of `rep`

as `replicate n`

where the correct `n`

is chosen
automatically. So, using `rep`

instead of `replicate`

, the above becomes

```
1,2,3],
[[4,5,6], (rep [10,20,30])
map (map (+)) [7,8,9]] [
```

In the above examples, we’re applying scalar operators to higher-ranked
arguments and then massaging the application into something that makes sense/is
*rank-correct* by inserting `map`

s and `rep`

s. This works for more than just
scalar operators: any function application involving a function whose parameters
have a rank mismatch with the arguments can be transformed into a rank-correct
application in this way.

As a final more general example, let’s look at a function `f : []int -> [][]int -> int -> int`

applied to argument an `xss : [][]int`

:

` f xss xss xss`

The above application can be made rank-correct by inserting the following
`map`

s and `rep`

s

` map ((map f xss) (rep xss)) xss`

(All functions in Futhark are curried;
`f xss xss xss`

is really `((f xss) xss) xss`

and is hence actually three
separate applications from the perspective of inserting `map`

s and `rep`

s.) This
approach to making applications rank-correct is completely agnostic to
infix/prefix application or any specific properties a function may have—an
application can always be made rank-correct by insertion of `map`

s and `rep`

s.
If the compiler can do said insertion automatically, then we can always apply
functions to arguments of any rank. Programming languages that support function
application like this are called *rank polymorphic*.^{2} Our task, then, is to
endow Futhark with rank polymorphism by giving it the capability to
automatically infer `map`

s and `rep`

s to make function applications
rank-correct.

### (Static) rank polymorphism is hard

Most rank polymorphic languages are *dynamic*: they don’t do the requisite
lifting and replicating until runtime, when the arguments to a function are
fully known. NumPy
and APL are probably
the two most popular examples of dynamic rank polymorphic languages.

It’s generally easier to do rank polymorphism in dynamic programming systems like NumPy because you don’t have to figure out how exactly to make a function application rank-correct until runtime, when everything (function and arguments) is fully known. Of course, there are drawbacks including runtime overhead and a lack of transparency to the user (the only way to inspect how a function is lifted/replicated is to run the program).

In a statically-typed language like Futhark, things are harder. How to make a
function application rank-correct (how to insert `map`

s and `rep`

s) has to be
decided statically during type checking, without any runtime information
whatsoever.^{3} This is made especially hard by the fact that Futhark supports
parametric polymorphism (that is, you can write general functions with type
variables like `id : a -> a`

or `map : (a -> b) -> []a -> []b`

)—a type
variable can stand for any type and hence have any rank, which complicates
reasoning about how to insert `map`

s and `rep`

s.

### Ambiguity makes things harder still

There are more challenges on the horizon: there
are often many different ways to insert `map`

s and `rep`

s to make
a function application rank-correct. A simple example is

` sum (length xss)`

where `sum : []int -> int`

, `length : []a -> int`

, and `xss : [][]int`

. One way
to make this application rank correct is by not modifying the `length xss`

application
at all and instead `rep`

it when it’s passed to `sum`

as an argument:

` sum (rep (length xss))`

But if you’re like me and lone `rep`

s give you the ick, you’ll probably prefer

` sum (map length xss)`

where the inner application with `length`

is lifted so that the outer
application with `sum`

doesn’t have to `rep`

its argument. Or we might just go
totally crazy and come up with this rank-correct concoction:

` map sum (map (map length) (rep xss))`

In fact, there are actually an *infinite* number of ways that we can insert
`map`

s and `rep`

s to make the application rank-correct. What should we choose
and how should we choose it? We’re going to need a strategy!

### The Strategy

Our first step in coming up with a system in which the compiler can make a
reasonable choice on how to `map`

/`rep`

an application is laying some ground
rules. There are two:

**Rule 1**: An application can be `map`

ped or `rep`

ped (or neither) but **never both**.

This means, given `f x`

, we can write (as long as it’s rank-correct to do so)
`map f x`

, `map (map f) x`

, or `f (rep (rep x))`

, but writing `map f (rep x)`

or
any variation that includes a non-zero number of `map`

s *and* `rep`

s isn’t
allowed.

This restriction exists because it’s never necessary to both `map`

and `rep`

in
a single application since there are only two possible ways that there can be a
rank mismatch: either the argument’s rank is higher than what the function
expects (and hence a `map`

(or multiple) is required) or the argument’s rank is
less than what the function expects (and one or more `rep`

s is required).

Note that the example

```
1,2,3],
[[4,5,6], + [10,20,30]
[7,8,9]] [
```

which was elaborated into

```
1,2,3],
[[4,5,6], (rep [10,20,30])
map (map (+)) [7,8,9]] [
```

doesn’t violate Rule 1 because functions are curried—the above is actually two
separate applications, one with `map`

s and the other with a `rep`

.

**Rule 2**: Minimize the number of inserted `map`

s and `rep`

s.

This rule says that we want the compiler to insert the fewest number of `map`

s
and `rep`

s to make the program rank-correct. We found that this generally aligns
with the programmer’s intent and that it also provides for a simple mental model
for the programmer. For example, if `ysss : [][][]int`

is a three-dimensional array,
no `map`

s or `rep`

s will be inserted at all for the program `length ysss`

even
though `map length ysss`

and `map (map length) ysss`

are both legal elaborations.

This also helps us address ambiguity: the compiler no longer has to choose from an infinite number of solutions and instead only the minimal ones. Unfortunately, sometimes there are multiple minimal solutions and hence Rule 2 doesn’t entirely free us from the possibility of ambiguity.

### Even with The Strategy, ambiguities may exist

Sometimes Rule 1 and Rule 2 aren’t enough. Returning to our previous example involving
`sum`

and `length`

, observe that

` sum (rep (length xss))`

and

` sum (map length xss)`

are both minimal because they have the same number of inserted operations (one
each). When there are multiple minimal elaborations, we say that the program is
*ambiguous*. In this case, we have to ask the programmer for some help by
signaling an error. Because we’re doing everything statically, we know the two
possibilities and the compiler can give the programmer a choice:

```
sum (length xss) is ambiguous. Possible elaborations:
1) sum (rep (length xss))
(2) sum (map length xss) (
```

We can conceive of an interactive system where the user just types `1`

or `2`

to
make a choice and disambiguate (although having the compiler modify the
programmer’s source code in this way probably isn’t a particularly good design).
A better (and simpler) approach is a system where the user has to manually
disambiguate. In practice, it’s actually pretty nice too! Why? Because `rep`

and
`map`

are just normal Futhark functions and can always be inserted by the user
to disambiguate. If the user wants option 2, they can just plop a `map`

in the
source code in front of `length`

to disambiguate.

This is actually a pretty cool feature: rank polymorphism in Futhark is
basically a kind of syntactic sugar that lets you leave `map`

s and `rep`

s
implicit. You can always disambiguate when necessary by just inserting `map`

s or
`rep`

s yourself (usually just one) to make the implicit `map`

or `rep`

explicit. It also means you can ask the compiler to elaborate all implicit
`map`

s and `rep`

s so you can always see what’s going on. No mystery here!

### All applications must be considered simultaneously

Another thing to notice in the

` sum (map length xss)`

elaboration of `sum (length xss)`

is that we had to consider the outer
application with `sum`

to know how to `map`

the inner `length`

application. In
general, lifting one application can affect others that depend on it because
`map`

ping changes the rank of the application itself. The implication of this is
quite significant: inserting `map`

s or `rep`

s to make an application
rank-correct requires considering the entire top-level definition at hand
(rather than just the local application).

This doesn’t mean it’s impossible to do rank polymorphism by only considering
local applications one-by-one. If we had just considered the inner `length xss`

application in isolation, Rule 2 would yield `length xss`

without any
`map`

. Proceeding to the outer application then yields the other possible
minimal elaboration:

` sum (rep (length xss))`

But this was just good luck; only considering applications one-by-one will **not**
lead to a minimal solution in general. It will also often lead to “ugly”
elaborations that the programmer probably didn’t intend. Minimality in the
number of `map`

s and `rep`

s across a function’s body as a whole is a useful
target because we’ve found it usually corresponds with what the programmer
intends; non-minimality is often the product of needing extra `map`

s and
`rep`

s to overcome the extra dimensions introduced by `map`

s and `rep`

s in
earlier applications. In practice, programmers don’t write code like this and
hence the subsequent elaboration by the compiler probably won’t correspond to
their intent.

Now that we have our objective and strategy in hand, there’s only one thing left to do—find the rank-correct elaborations while heeding Rule 1 and Rule 2!

### Finding minimal rank-correct elaborations

The first step involves traversing the program and, for every application,
generating *constraints* that encapsulate the requirements to obtain a
rank-correct program. We (or, rather, the compiler) can then solve those
constraints to figure out how to insert `map`

s or `rep`

s to make things
rank-correct.

Consider the application of some function `f : a -> b`

to an argument `x : c`

:

` f x`

In a normal application, we require that the type of function’s parameter and the actual argument are equal, so we’d have the constraint

` a = c`

In our context, we’re only interested in rank equality (actual equality will be
handled by the type checker later, after all `map`

s and `rep`

s have been
inserted and the applications are rank-correct), so we relax the constraint to
just enforcing rank equality, which we write like so:

` |a| = |c|`

where `|a|`

means “the rank of `a`

”. For example, `|[][]int| = 2`

and `|int| = 0`

.

Now we want to relax things further and allow for rank differences. If `c`

has
fewer dimensions than `a`

(i.e. `|c| < |a|`

) we’ll have to `rep`

it. We don’t
know the rank difference so we don’t know how much to replicate it by yet (how
many `rep`

s we need), so we’ll represent this unknown quantity with the *rank
variable* `R`

:

` |a| = R + |c|`

For example, if `f : [][]int -> int`

and `x : []int`

then we’d have

` |[][]int| = R + |[]int|`

which is equal to

`2 = R + 1`

We can clearly see that this constraint holds when `R = 1`

; this means we need
one `rep`

to make the application rank-correct and `f (rep x)`

would be a valid
elaboration.

The other way `a`

and `c`

might have a rank mismatch is that `c`

has more
dimensions than `a`

(i.e. `|c| > |a|`

), in which case we’ll have to `map`

. We
again don’t know how many `map`

s we’ll need, so we’ll represent it with the rank
variable `M`

:

` M + |a| = |c|`

Now, for example, if `f : int -> int`

and `x : [][]int`

then we’d have

` M + |int| = |[][] int|`

which is equivalent to

`2 M = `

This means we have to add two `map`

s to make the application rank-correct: `map (map f) x`

.

Of course—since we don’t know anything about the rank difference in
general—we don’t know if we need to `map`

or `rep`

so we need to include both
`R`

and `M`

in our rank constraint:

` M + |a| = R + |c|`

We still need to enforce our first rule—that we either `map`

or `rep`

**but
not both**. To do so, we just require that either `M = 0`

or `R = 0`

via an
additional constraint. Hence, every function application `f x`

generates two
constraints:

` M + |a| = R + |c|`

and

`0 or R = 0 M = `

During type checking, we accumulate these constraints into a big list. For
example, when checking `sum (length xss)`

, we first check `length xss`

(recalling that `length : []a -> int`

and `xss : [][]int`

). This yields the
constraints

```
1 + |a| = R₁ + 2
M₁ + 0 or R₁ = 0 M₁ =
```

(Where we’ve added the subscript `1`

to the rank variables, since we’ll have
two sets of rank variables from the two applications and need to distinguish
them.) Notice that this time the first constraint includes the rank of a type
variable: `|a|`

. This is necessary because the ranks of type variables are not
immediately known (they’re type variables and can be anything). You should think
of `|a|`

as just another rank variable that the system will have to solve
for. To solve the above constraints, we have to find a mapping that assigns a
rank to `M₁`

, `R₁`

and `|a|`

.

Now we have to check the outer application involving `sum`

.
We have to be a bit careful when constructing the requisite rank equality
constraint between the parameters of `sum`

and its argument `length xss`

. If
the rank variable `M₁`

above is assigned a non-zero value, then that means that
the application `length xss`

has implicit maps—namely `M₁`

of them. This
means that (recalling that `length : []a -> int`

) the type of the term `length xss`

will not be `int`

but instead be an `M₁`

-dimensional `int`

array because
each implicit `map`

will boost the rank of the application itself by 1. So, we
have to tack on extra dimensions to the type of the application corresponding to
each implicit `map`

: `length xss : []ᴹ¹ int`

, where `[]ᴹ¹`

is just a
sequence of `[]`

of length `M₁`

(and where we understand `length xss`

to have
`M₁`

implicit maps). If we end up inferring that no `map`

s are to be inserted,
then `M₁ = 0`

and we’ll have `length xss : []⁰ int`

which is equivalent to
`length xss : int`

, as expected.

As an example, if `xss : [][]int`

then the explicitly elaborated application
`map length xss : []int`

is rank-correct and hence we assign the *implicit*
application `length xss`

the type `[]int`

.

So, when generating constraints for `sum (length xss)`

we’re really generating
constraints for `sum (length xss : []ᴹ¹ int)`

because of the (potential)
implicit maps in `length xss`

. Recalling that `sum : []int -> int`

), we have

```
M₂ + |[]| = R₂ + |[]ᴹ¹ int|0 or R₂ = 0 M₂ =
```

which is equivalent to

```
1 = R₂ + M₁
M₂ + 0 or R₂ = 0 M₂ =
```

Notice that the rank equality constraint depends on `M₁`

(the `map`

rank
variable generated in the constraints for `length xss`

). This is an explicit
witness for the fact that how you insert `map`

s at one application affects
subsequent applications that depend on it.

Combining the constraints, the constraint set that expresses the rank
requirements for `sum (length xss)`

is

```
1 + |a| = R₁ + 2
M₁ + 0 or R₁ = 0
M₁ = 1 = R₂ + M₁
M₂ + 0 or R₂ = 0 M₂ =
```

What’s left is to find a minimal assignment for `M₁`

, `R₁`

, `M₂`

, `R₂`

, and
`|a|`

(and more generally for any constraint set generated in this way) such
that the above constraints hold .

### Solving the constraints

Our first order of business is figuring out how to get a minimal solution (that
is, a way to enforce Rule 2—recall that Rule 1 is already enforced by the
constraints themselves). Since this just requires that the number of `map`

s and
`rep`

s (which are represented by the `M`

and `R`

rank variables, respectively)
is minimal, this is the same thing as minimizing the sum of the rank
variables. For the example above, that corresponds to the following minimization
problem:

```
minimize
M₁ + R₁ + M₂ + R₂
subject to1 + |a| = R₁ + 2
M₁ + 0 or R₁ = 0
M₁ = 1 = R₂ + M₁
M₂ + 0 or R₂ = 0 M₂ =
```

If you’re familiar with integer linear programs
(ILPs), you may have noticed
that this looks a lot like one except for the or-constraints enforcing
Rule 1, which aren’t linear. Fortunately, we can linearize those constraints
using some linear programming trickery: for each pair of rank variables `Mᵢ`

and `Rᵢ`

, we introduce two binary variables `b_Mᵢ`

and `b_Rᵢ`

that indicate
whether the application is to be `map`

ped or `rep`

ped. The idea is to enforce
that `Mᵢ > 0`

only if `b_Mᵢ = 1`

(and analogously for `Rᵢ`

and
`b_Rᵢ`

). Then, by requiring that `b_Mᵢ + b_Rᵢ ≤ 1`

, we then enforce that
either `Mᵢ = 0`

or `Rᵢ = 0`

. To establish the requisite relationship between
`Mᵢ`

and `b_Mᵢ`

, we add the constraint

` Mᵢ ≤ b_Mᵢ * BIG_NUMBER`

where `BIG_NUMBER`

is just a big number. When `b_Mᵢ = 0`

, we have `Mᵢ ≤ 0`

forcing it to be 0. When `b_Mᵢ = 1`

, we have `Mᵢ ≤ BIG_NUMBER`

which
effectively lets the solver assign `Mᵢ`

to be any rank-satisfying value (as
long as `BIG_NUMBER`

is bigger than any rank we expect to see in a program—in
practice ranks are small so it isn’t a big deal). We do something analogous for
`Rᵢ`

and `b_Rᵢ`

to obtain the linearized minimization problem

```
minimize
M₁ + R₁ + M₂ + R₂
subject to1 + |a| = R₁ + 2
M₁ + 1
b_M₁ + b_R₁ ≤
M₁ ≤ b_M₁ * BIG_NUMBER
R₁ ≤ b_R₁ * BIG_NUMBER1 = R₂ + M₁
M₂ + 1
b_M₂ + b_R₂ ≤
M₂ ≤ b_M₂ * BIG_NUMBER
R₂ ≤ b_R₂ * BIG_NUMBER
```

Now we have ourselves a real ILP! To solve it, we can just feed it into any existing solver (there are many, in our prototype we used GLPK) or implement one of the known algorithms for ILP solving. Technically, ILP is in NP (all known solvers have exponential runtime) but in practice solvers are very fast and the ILPs we generate are small, so even if you’re solving a bunch of these ILPs when type checking a large program it still makes for a very usable system. (Besides, normal type inference has exponential runtime anyway.)

### Putting it all together

We now have all the ingredients we need to bestow Futhark with the power of rank polymorphism, a.k.a AUTOMAP. Here’s how the AUTOMAP system works from start-to-finish for each top-level definition in a program:

For each application generate a rank equality constraint and a Rule 1 constraint and add it to the constraint set.

Transform the constraint set into an ILP and solve it. If the ILP has no solutions or multiple solutions, fail and report an error to the user.

The solution to the ILP tells the compiler how to insert the minimal number of

`map`

s and`rep`

s to make the program rank-correct. For example, if`Mᵢ = 3`

and`Rᵢ = 0`

for an application`f x`

, then transform the application by inserting three`map`

s and no`rep`

s:`f x ⟶ map (map (map f)) x`

.Type check the transformed program with all implicit

`map`

s and`rep`

s made explicit using Futhark’s normal type checker and continue with compilation as usual.

The extra nice thing here is that we can do all this without affecting how the
rest of the compiler works. This is a direct result of the fact that Futhark
treats rank polymorphism as syntactic sugar and there’s no “magic” going
on—all implicit constructs (`map`

and `rep`

) are just standard functions
already available to the programmer. This also means that the programmer is free
to use the AUTOMAP to whatever extent that they wish—from omitting all `map`

s
and `rep`

s (up to ambiguity), to only omitting some, to having a program with all `map`

s
and `rep`

s explicit.

That’s it! P.S. In case you’re wondering—we call the system AUTOMAP instead of
AUTOMAPREP because implicit `map`

s are way more common than implicit `rep`

s (and
AUTOMAPREP isn’t as catchy).

**Footnotes**

Of course you could be rigorous here and also add something along the lines of “define

`(x + y)_i = x_i + y_i`

” (where`x_i`

is the`i`

th component of`x`

) to your notes so that the operation is well-defined. With this approach there are really two definitions of`+`

—one for scalars and one for vectors and we’ve just grouped them under the same name. From a programming languages perspective, this corresponds to ad-hoc polymorphism (a.k.a function/operator overloading) wherein you group a bunch of functions (which expect different inputs) under the same name.↩︎In most languages with rank polymorphism, this isn’t a property of function applications but rather of functions

*themselves*. Typed Remora is a nice example because it demonstrates this explicitly at the type-level: just like how we abstract over type parameters in parametric polymorphism (e.g.,`id : forall a. a -> a`

), Remora abstracts over*shape parameters*as well (“shape polymorphic” might be a better name than “rank polymorphic”). For example,`sum : forall S. S int -> int`

(where`S`

is a shape variable) might collapse any higher-dimensional array (say a matrix or a vector or a 36-dimensional tensor) into an integer. Unfortunately, this generality comes at a cost and makes type inference very, very challenging.↩︎You may be wondering why you couldn’t just type check function applications modulo rank mismatches and delay deciding how to

`map`

and`rep`

until runtime. There are multiple reasons that is a bad idea aside from the usual dynamic rank polymorphism drawbacks: the ensuing lack of precise static type information would also weaken type checking/inference, make many optimizations intractable, and basically muck-up the existent compiler from end to end.↩︎