futhark-0.11.0: An optimising compiler for a functional, array-oriented language.

Futhark.Pass.ExtractKernels

Description

Kernel extraction.

In the following, I will use the term "width" to denote the amount of immediate parallelism in a map - that is, the outer size of the array(s) being used as input.

# Basic Idea

If we have:

  map
map(f)
bnds_a...
map(g)


Then we want to distribute to:

  map
map(f)
map
bnds_a
map
map(g)


But for now only if

1. it can be done without creating irregular arrays. Specifically, the size of the arrays created by map(f), by map(g) and whatever is created by bnds_a that is also used in map(g), must be invariant to the outermost loop.
2. the maps are _balanced_. That is, the functions f and g must do the same amount of work for every iteration.

The advantage is that the map-nests containing map(f) and map(g) can now be trivially flattened at no cost, thus exposing more parallelism. Note that the bnds_a map constitutes array expansion, which requires additional storage.

# Distributing Sequential Loops

As a starting point, sequential loops are treated like scalar expressions. That is, not distributed. However, sometimes it can be worthwhile to distribute if they contain a map:

  map
loop
map
map


If we distribute the loop and interchange the outer map into the loop, we get this:

  loop
map
map
map
map


Now more parallelism may be available.

# Unbalanced Maps

Unbalanced maps will as a rule be sequentialised, but sometimes, there is another way. Assume we find this:

  map
map(f)
map(g)
map


Presume that map(f) is unbalanced. By the simple rule above, we would then fully sequentialise it, resulting in this:

  map
loop
map
map


## Balancing by Loop Interchange

The above is not ideal, as we cannot flatten the map-loop nest, and we are thus limited in the amount of parallelism available.

But assume now that the width of map(g) is invariant to the outer loop. Then if possible, we can interchange map(f) and map(g), sequentialise map(f) and distribute, interchanging the outer parallel loop into the sequential loop:

  loop(f)
map
map(g)
map
map


After flattening the two nests we can obtain more parallelism.

When distributing a map, we also need to distribute everything that the map depends on - possibly as its own map. When distributing a set of scalar bindings, we will need to know which of the binding results are used afterwards. Hence, we will need to compute usage information.

# Redomap

Redomap can be handled much like map. Distributed loops are distributed as maps, with the parameters corresponding to the neutral elements added to their bodies. The remaining loop will remain a redomap. Example:

redomap(op,
fn (v) =>
map(f)
map(g),
e,a)


distributes to

let b = map(fn v =>
let acc = e
map(f),
a)
redomap(op,
fn (v,dist) =>
map(g),
e,a,b)


Note that there may be further kernel extraction opportunities inside the map(f). The downside of this approach is that the intermediate array (b above) must be written to main memory. An often better approach is to just turn the entire redomap into a single kernel.

Synopsis

# Documentation

Transform a program using SOACs to a program using explicit kernels, using the kernel extraction transformation.