Evolution of Futhark's Array Representation

Posted on March 6, 2024 by Troels Henriksen

As a functional array language with a focus on compiler optimisations, it is no surprise that much development has gone into Futhark’s representation of arrays. After recently spending some time on tweaking it yet again, I figured it might be interesting to write a few words about how it has changed over time. We also have a paper on the basic approach and a paper on a fancy optimisation built on top of it for those who prefer PDFs.

Array representations generally

Before getting to compiler representations, let us talk more generally about how to store arrays on a computer. For simplicity, all of the examples will use one- or two-dimensional arrays, but the ideas generalise to any rank.

Arrays are conceptually simple: they are sequences of values. If we want to get fancy, we can allow those elements to be arrays themselves, yielding multidimensional arrays, and if we want to get really fancy, we can allow each of those subarrays to be of a different size. But now we have gotten too fancy, so let us dial back the discussion to be about multidimensional regular arrays (sometimes also called rectangular), where all elements have their own size. In fact, we will only consider multidimensional arrays of primitive types, as arrays of records and other such compound types can be transformed away.

How do we represent such arrays at runtime in a computer? The simplest solution I know of is to store the shape of the array as a shape vector, one integer per dimension, as well as a pointer to somewhere in memory where all the actual array elements are stored.

There are many ways in which we can store the elements. The crucial question we need to answer is how we map a point in the (potentially) multidimensional index space of the array to the one-dimensional offset space of computer memory. The order in which we store the elements can have a drastic impact on performance due to spatial locality effects (e.g. cache efficiency). For example, consider storing the elements of following two-dimensional array:

[[1,2,3],
 [4,5,6]]

If we use the common row-major order, then the elements are laid out in memory as

1 2 3 4 5 6

and we can transform a two-dimensional index (i,j) to a flat offset with the formula i*3+j.

Each row is contiguous in memory, meaning it is efficient to traverse a row sequentially. However, if we traverse the first column, we see that the values 1 and 4 are distant in memory (well, not so distant for this small array). On the other hand, the column-major order representation would use the ordering

1 4 2 5 3 6

and the indexing formula becomes j*2+i. Now columns are contiguous in memory and can be traversed efficiently, but traversing a row involves jumping with a larger stride. With a simplistic representation, consisting of just a shape vector and a pointer, which layout should we pick?

It is important that array layout and array usage are a good match. Futhark does not provide direct control over memory layout to the programmer, but it is crucial that the compiler itself is able to transform array layouts as part of its optimisations. Further, different arrays might need different layouts. This means the compiler needs to have some kind of language, or notation, in which it can reason abstractly about array layouts, and annotate the program representation with how arrays should be laid out in memory at run-time.

Apart from performance concerns, the array representation also affects which operations can be performed efficiently at run-time. For example, in the representation outlined above, where the only metadata is the shape of the array and we store the elements in row-major order, it is cheap to interpret an array of 10 elements as a two-dimensional array of shape [2][5] (using Futhark shape notation), simply by changing the shape vector and not touching the elements. This means that we could guarantee that a reshape operation is essentially free in all cases. On the other hand, consider slicing a column out of a two-dimensional array in order to obtain a one-dimensional array. Since this array representation requires contiguous storage of elements, and the elements of a column are not contiguous, we have to allocate new space and copy over the elements.

It seems desirable to come up with an array representation that minimises how often a transformation requires us to copy array elements, but as we will see, flexible representations have their own downsides. As we will see, the Futhark compiler actually moved from an initially very flexible representation to a rather simple one.

Index functions

We will stick to a model in which a q-dimensional array consists of a pointer p to somewhere in memory, along with an index function f

f : ℕ^q → ℕ

that maps an index in a q-dimensional index space (the domain of the function) to a flat offset relative to the pointer. That is, when indexing an array A[I] we compute the address p+f(I) and fetch an element from there (probably multiplying by an element size along the way). In some treatments, index functions are functions from indexes to values, but in Futhark we treat them as producing offsets - this is because we want to use them to reason about the physical storage of data, not just to model arrays theoretically.

We’ve already made an important restriction: arrays are associated with a single memory block, which means we cannot guarantee efficient (zero-copy) concatenation of arrays. This is to make memory management easier.

As mentioned above, the Futhark compiler needs the ability to transform and reason about index functions, in order to make decisions about array layout. This means we cannot allow index functions to be completely arbitrary functions.

In our original implementation, an index function started with a “base”, indicating the shape of an underlying row-major array:

RowMajor(n,m)

For any index function, we must be able to apply it to any point in its index space:

apply(RowMajor(n,m), (i,j)) = i * m + j

Note that the elements n,m of the index function, or the indexes i,j, do not need to be numbers - they can be symbolic variables instead, such that index produces an expression rather than a number. We can then embed index functions directly into the intermediate representation (IR) that the compiler uses to represent Futhark programs. Whenever we bind an array (here a copy of an existing array A of type [n][m]i32) to a name B, the IR contains the type of the array, the memory block in which it is stored (skipping some detail here), and the index function used to access it:

let (B : [n][m]i32 @ B_mem -> RowMajor(n,m)) = copy A

When the compiler generates code for an array access B[i,j], it symbolically applies the index function RowMajor(n,m) to the index (i,j), yielding the expression i * m + j, which will then be emitted as code. At runtime, the concrete values of i,m,j will be known, and the actual offset will be computed. This means that index functions do not exist at runtime - they are entirely a compile-time mechanism that the compiler uses to track array layouts.

Index space transformations of an array, such as slicing or indexing, are done by adding more operations to the index function, which essentially consts of an arbitrary-length sequence of transformations. For example, if F is an index function for a two-dimensional array, we can index the first dimension with

F->Index(i)

yielding a one-dimensional index function. We define application of the index function thus:

apply(F->Index(i), I) = apply(F, (i, I))

(Writing (i, I) for prefixing i to the tuple I.)

In the IR, we might use it like this:

let (C: [m]i32 @ B_mem -> RowMajor(n,m)->Index(i)) = B[i]

Note how the array C is stored in the same memory block B_mem as the array B - this directly expresses the fact that a copy is not necessary. In fact, the expression B[i] does not result in the compiler generating any code at all - when C is indexed at some point, the compiler will consult the index function, which is statically known, and emit the appropriate address calculations.

There is in principle no limit to which kinds of index space transformations we can support in this way, as long as we can define application for them, and the Futhark compiler used a robust set of them - for example including things like Transpose, which was how we represented column-major arrays:

apply(F->Transpose, (i,j)) = apply(F, (j,i))

In some cases, we have no choice how to assign an index function to an array - the result of an index expression must certainly be an Index on top of the underlying index function, as above, and a transposition must of course be a Transpose. But in those cases where we are creating a fresh array (such as when doing a copy, or for that matter a map), the compiler can invent whatever index function it wishes.

Reconciling different index functions

Since index functions are treated as statically known symbolic functions by the compiler, questions arise about what happens in cases where the index function cannot be fully statically known. The compiler can always fall back to normalising index functions to be e.g. row-major, at the run-time cost of a copy, but naturally we don’t want to do that all the time.

Consider an array A with index function RowMajor(n,n). As above, an expression A[i] produces an array with index function RowMajor(n,n)->Index(i), and similarly A[j] produces RowMajor(n,n)->Index(j). If we have an expression

let B = if c then A[i] else A[j]

then what is the index function of B? We can avoid a copy by augmenting the if to also return an integer that is used in the index function of B, as follows:

let (l: i64, B : [n]t @ B_mem -> RowMajor(n,n)->Index(l)) =
  if c then (i, A[i])
       else (j, A[j])

This is essentially piggybacking on the machinery that we already use to handle expressions that return arrays of unknown (“existential”) size (previously, previously).

Unfortunately, the only part of the index function that we can parameterise are the scalar “leaves” (here, the specific argument to Index). The structure itself cannot be parameterised. This limitation shows up in cases such as

let B = if c then A else transpose A

The two branch cases produce index functions RowMajor(n,n) and RowMajor(n,n)->Transpose respectively - how can we find a most general unifier that can be instantiated with either? We could perhaps extend our notion of parameterisation of index functions to allow something like

let (f: IxFun, B: f) = ...

where f is then a run-time representation of an index function, but this radically complicates the run-time behaviour of Futhark programs, which we would rather avoid. Instead, our solution is that whenever the index functions differ structurally, we insert a copy to normalise:

let (B_mem: mem) (B: [n][n]t -> B_mem @ RowMajor(n,n)) =
  if c then copy A
       else copy (transpose A)

We can also parameterise over the memory block here to avoid having to copy in the first branch, but w’ll focus on index functions for this post.

Incidentally, such a copy is also inserted when an array with a non-RowMajor index function is returned from an entry point, because we require that all arrays passed in or out of Futhark are in row-major order, as otherwise the external representation would need to somehow encode the index function.

Recognising array layouts

Index functions only address the problem of computing the offset for a single element, but sometimes the compiler wishes to generate code for bulk operations, where taking a more high level view can be more efficient.

Suppose we have an array with index function F1, and we want to copy its elements to form a new array with index function F2. We assume that the two index functions have the same index space, as otherwise such a copy is nonsensical. One obvious way to implement the copy is to simply iterate through all indexes I and copy from offset F1(I) to offset F2(I) in the respective memory blocks. While this can be done in parallel, it is most likely not maximally efficient. If the two index functions represent the same element order, and they are contiguous, then the copy can be done as a more efficient memcpy-like bulk copy, which might take advantage of hardware-specific optimisations. Similarly, when copying from a row-major to a column-major index function, it is much more efficient to make use of a locality-optimised implementation of transposition than it is to use a naive loop (even if parallel). These are special cases, but they are rather common, and so it is crucial that we are able to reliably detect them. This is where a very flexible index function representation becomes awkward, as it can be difficult to analyse long chains of transformations and recognise known structures.

The Futhark compiler largely relied on pattern-matching index functions to detect these important special cases for copies, but it turned out to be unviable when we set out to implement more complicated access pattern analyses, such as needed for Philip Munksgaard’s work on short circuiting. We needed a simpler and more uniform representation of index functions than the purely structural one.

Introducing LMADs

Linear Memory Address Descriptors (LMADs) were introduced by Paek, Hoeflinger, and Padua in their paper Efficient and Precise Array Access Analysis from 2002, where it was used for analysing the access patterns of imperative loops for the purpose of automatic parallelisation. Some years ago, Futhark co-creator Cosmin Oancea realised that LMADs might also be useful as a uniform representation of index functions, since they come with robust prior work on how to analyse them.

LMADs are quite simple. An LMAD for a q-dimensional array consists of an integral offset o and q pairs (n,s), where n is the non-negative size of the dimension and s is the integral stride. The pairs are conventionally written juxtaposed, as:

o + (n1,s1)(n2,s2)...(nN,sN)

To apply an LMAD to a q-dimensional index, we simply multiply each index component with the corresponding stride, sum the resulting terms, and add the offset. For example, an array of shape [2][3] in column-major order and zero offset would be written as the LMAD

0 + (2,3)(3,1)

and application to an index (i,j) is done by computing

i*3 + j*1 + 0

An LMAD can represent many (but not all) of the typical index space transformations. For sample, we can fix any given dimension (“indexing”) by multiplying the index with the corresponding stride and adding it to the offset. For example, if we have a two-dimensional array A with LMAD

o + (n1,s1)(n2,s2)

then the LMAD of the array A[i] would be

(o + i*s1) + (n2,s2)

where the parentheses simply serve to clarify that the offset is separate from the pairs. Transposition (or any permutation of dimensions) is done simply by permuting the LMAD. Negative strides are useful for expressing reversals. The array reverse A (where the outermost dimension is reversed) would be given by this LMAD:

(o + (n1-1)*s1) + (n1,-s1)(n2,s2)

While LMADs do in principle allow zero strides, we do not provide any mechanism for creating them in Futhark, as they have unfortunate implications for in-place updates - generally, we do not allow dimensions to overlap, although this is in principle allowed by the LMAD representation.

LMADs have strong similarities to dope vectors (and for all I know might have been used under a different names previously), but with the common difference that the stride of a dimension does not necessarily imply anything about the innermost dimensions. One attractive quality is that any LMAD for a q-dimensional array, no matter how complicated, can be represented with 2q+1 integers. This also makes it easy to parameterise them, because those integers are all there is to parameterise. For any q, the structure is completely fixed (and Futhark does not allow rank polymorphism so q is always known).

Using LMADs, the if-and-transpose example above can now be solved as follows:

let (s1: i64, s2: i64, B_mem: mem) (B: [n][n]t -> B_mem @ 0 + (n,s1)(n,s2)) =
  if c then (n, 1, A)
       else (1, n, transpose A)

The two branches now return the appropriate strides that are then referenced in the LMAD 0 + (n,s1)(n,s2). No copies! The copying special cases can also be recognised fairly easily at run-time simply by inspecting the concrete values of the LMADs, since they are structurally similar.

Multiple LMADs

Unfortunately, not all layouts can be expressed with LMADs. In particular, the restriction to a uniform stride along each dimension means some index space transformations are inexpressible. For example, consider first transposing a matrix, then flattening it to a single dimension: flatten (transpose A). If the original value of A is

[[1,2,3],
 [4,5,6]]

and if we suppose it is stored in memory in row-major order as

[1,2,3,4,5,6]

then semantically the transformed value should be

[1,4,2,5,3,6]

but there is no way to write a one-dimensional LMAD that produces the necessary offsets.

However, this can be handled if we use an index function representation that allowed a composition of multiple LMADs.

The way we index a composition of two LMADs L1∘L2 is by first applying the index to L2, yielding a flat offset, unravelling that offset into the index space of L1, then applying L1 to that now multi-dimensional index. In particular, L1 and L2 need not agree on how many dimensions exist in their index space. The cost of the unravelling is somewhat significant (a division, modulo, and subtraction per dimension), although probably still less than the cost of a memory access, except for very high-dimensional arrays.

We used the composition-of-LMADs representation for a few years, as we were reluctant to to perform copies when flatten (and unflatten) were combined with transpositions, but we eventually realised that we still had many of the problems caused by our initial structural index functions. In practice, the compiler would throw up its hands and give up on analysing any index function that comprised more than a single LMAD, and the compiler had lots of hacks and ad-hoc code to detect the cases when a single LMAD would be sufficient. Eventually we decided that a few extra copies was a small price to pay in return for significantly simplifying the representation, and so we finally moved to single-LMAD index functions. Our journey from complexity to simplicity had left a little more representational scar tissue that I haven’t gotten into, which we gradually removed over a few further pull requests, and were finally left with an index function being a single LMADs, with no additional information

The end result was a significant simplification of the compiler - both conceptually and technically - and with no performance regressions outside of contrived test programs.

One reason we could get away with this is that index functions are not the only way in which we optimise array accesses. If you write something like let B = flatten (transpose A), then yes, that will cause a copy if B is not optimised away entirely. There are many other places in the compiler where operations on B can be fused, simplified, inlined, or otherwise adjusted to directly operate on A instead. Although when specifying worst case behaviour in the cost model, we must of course not promise more than we can guarantee, so reshaping now in theory a copying operation - which is a bit unusual among array languages.