Evolution of Futhark's Array Representation
As a functional array language with a focus on compiler optimisations, it is no surprise that much development has gone into Futhark’s representation of arrays. After recently spending some time on tweaking it yet again, I figured it might be interesting to write a few words about how it has changed over time. We also have a paper on the basic approach and a paper on a fancy optimisation built on top of it for those who prefer PDFs.
Array representations generally
Before getting to compiler representations, let us talk more generally about how to store arrays on a computer. For simplicity, all of the examples will use one- or two-dimensional arrays, but the ideas generalise to any rank.
Arrays are conceptually simple: they are sequences of values. If we want to get fancy, we can allow those elements to be arrays themselves, yielding multidimensional arrays, and if we want to get really fancy, we can allow each of those subarrays to be of a different size. But now we have gotten too fancy, so let us dial back the discussion to be about multidimensional regular arrays (sometimes also called rectangular), where all elements have the same size. In fact, we will only consider multidimensional arrays of primitive types, as arrays of records and other such compound types can be transformed away.
How do we represent such arrays at runtime in a computer? The simplest solution I know of is to store the shape of the array as a shape vector, one integer per dimension, as well as a pointer to somewhere in memory where all the actual array elements are stored.
There are many ways in which we can store the elements. The crucial question we need to answer is how we map a point in the (potentially) multidimensional index space of the array to the one-dimensional offset space of computer memory. The order in which we store the elements can have a drastic impact on performance due to spatial locality effects (e.g. cache efficiency). For example, consider storing the elements of following two-dimensional array:
[[1,2,3],
[4,5,6]]
If we use the common row-major order, then the elements are laid out in memory as
1 2 3 4 5 6
and we can transform a two-dimensional index (i,j)
to a flat
offset with the formula i*3+j
.
Each row is contiguous in memory, meaning it is efficient to traverse
a row sequentially. However, if we traverse the first column, we see
that the values 1
and 4
are distant in memory (well, not so
distant for this small array). On the other hand, the column-major
order representation would use the ordering
1 4 2 5 3 6
and the indexing formula becomes j*2+i
. Now columns are contiguous
in memory and can be traversed efficiently, but traversing a row
involves jumping with a larger stride. With a simplistic
representation, consisting of just a shape vector and a pointer, which
layout should we pick?
It is important that array layout and array usage are a good match. Futhark does not provide direct control over memory layout to the programmer, but it is crucial that the compiler itself is able to transform array layouts as part of its optimisations. Further, different arrays might need different layouts. This means the compiler needs to have some kind of language, or notation, in which it can reason abstractly about array layouts, and annotate the program representation with how arrays should be laid out in memory at run-time.
Apart from performance concerns, the array representation also affects
which operations can be performed efficiently at run-time. For
example, in the representation outlined above, where the only metadata
is the shape of the array and we store the elements in row-major
order, it is cheap to interpret an array of 10 elements as a
two-dimensional array of shape [2][5]
(using Futhark shape
notation),
simply by changing the shape vector and not touching the elements.
This means that we could guarantee that a reshape operation is
essentially free in all cases. On the other hand, consider slicing a
column out of a two-dimensional array in order to obtain a
one-dimensional array. Since this array representation requires
contiguous storage of elements, and the elements of a column are not
contiguous, we have to allocate new space and copy over the elements.
It seems desirable to come up with an array representation that minimises how often a transformation requires us to copy array elements, but as we will see, flexible representations have their own downsides. As we will see, the Futhark compiler actually moved from an initially very flexible representation to a rather simple one.
Index functions
We will stick to a model in which a q-dimensional array consists of a pointer p to somewhere in memory, along with an index function f
f : ℕ^q → ℕ
that maps an index in a q-dimensional index space (the domain of the function) to a flat offset relative to the pointer. That is, when indexing an array A[I] we compute the address p+f(I) and fetch an element from there (probably multiplying by an element size along the way). In some treatments, index functions are functions from indexes to values, but in Futhark we treat them as producing offsets - this is because we want to use them to reason about the physical storage of data, not just to model arrays theoretically.
We’ve already made an important restriction: arrays are associated with a single memory block, which means we cannot guarantee efficient (zero-copy) concatenation of arrays. This is to make memory management easier.
As mentioned above, the Futhark compiler needs the ability to transform and reason about index functions, in order to make decisions about array layout. This means we cannot allow index functions to be completely arbitrary functions.
In our original implementation, an index function started with a “base”, indicating the shape of an underlying row-major array:
RowMajor(n,m)
For any index function, we must be able to apply it to any point in its index space:
apply(RowMajor(n,m), (i,j)) = i * m + j
Note that the elements n,m
of the index function, or the indexes
i,j
, do not need to be numbers - they can be symbolic variables
instead, such that index
produces an expression rather than a
number. We can then embed index functions directly into the
intermediate representation (IR) that the compiler uses to represent
Futhark programs. Whenever we bind an array (here a copy of an
existing array A
of type [n][m]i32
) to a name B
, the IR contains
the type of the array, the memory block in which it is stored
(skipping some detail here), and the index function used to access it:
let (B : [n][m]i32 @ B_mem -> RowMajor(n,m)) = copy A
When the compiler generates code for an array access B[i,j]
, it
symbolically applies the index function RowMajor(n,m)
to the index
(i,j)
, yielding the expression i * m + j
, which will then be
emitted as code. At runtime, the concrete values of i,m,j
will be
known, and the actual offset will be computed. This means that index
functions do not exist at runtime - they are entirely a compile-time
mechanism that the compiler uses to track array layouts.
Index space transformations of an array, such as slicing or indexing, are done by adding more operations to the index function, which essentially consts of an arbitrary-length sequence of transformations. For example, if F is an index function for a two-dimensional array, we can index the first dimension with
F->Index(i)
yielding a one-dimensional index function. We define application of the index function thus:
apply(F->Index(i), I) = apply(F, (i, I))
(Writing (i, I)
for prefixing i
to the tuple I
.)
In the IR, we might use it like this:
let (C: [m]i32 @ B_mem -> RowMajor(n,m)->Index(i)) = B[i]
Note how the array C
is stored in the same memory block B_mem
as
the array B
- this directly expresses the fact that a copy is not
necessary. In fact, the expression B[i]
does not result in the
compiler generating any code at all - when C
is indexed at some
point, the compiler will consult the index function, which is
statically known, and emit the appropriate address calculations.
There is in principle no limit to which kinds of index space
transformations we can support in this way, as long as we can define
application for them, and the Futhark compiler used a robust set of
them - for example including things like Transpose
, which was how we
represented column-major arrays:
apply(F->Transpose, (i,j)) = apply(F, (j,i))
In some cases, we have no choice how to assign an index function to an
array - the result of an index expression must certainly be an Index
on top of the underlying index function, as above, and a transposition
must of course be a Transpose
. But in those cases where we are
creating a fresh array (such as when doing a copy
, or for that
matter a map
), the compiler can invent whatever index function it
wishes.
Reconciling different index functions
Since index functions are treated as statically known symbolic functions by the compiler, questions arise about what happens in cases where the index function cannot be fully statically known. The compiler can always fall back to normalising index functions to be e.g. row-major, at the run-time cost of a copy, but naturally we don’t want to do that all the time.
Consider an array A
with index function RowMajor(n,n)
. As above,
an expression A[i]
produces an array with index function
RowMajor(n,n)->Index(i)
, and similarly A[j]
produces
RowMajor(n,n)->Index(j)
. If we have an expression
let B = if c then A[i] else A[j]
then what is the index function of B
? We can avoid a copy by
augmenting the if
to also return an integer that is used in the
index function of B
, as follows:
let (l: i64, B : [n]t @ B_mem -> RowMajor(n,n)->Index(l)) =
if c then (i, A[i])
else (j, A[j])
This is essentially piggybacking on the machinery that we already use to handle expressions that return arrays of unknown (“existential”) size (previously, previously).
Unfortunately, the only part of the index function that we can
parameterise are the scalar “leaves” (here, the specific argument to
Index
). The structure itself cannot be parameterised. This
limitation shows up in cases such as
let B = if c then A else transpose A
The two branch cases produce index functions RowMajor(n,n)
and
RowMajor(n,n)->Transpose
respectively - how can we find a most
general
unifier
that can be instantiated with either? We could perhaps extend our
notion of parameterisation of index functions to allow something like
let (f: IxFun, B: f) = ...
where f
is then a run-time representation of an index function,
but this radically complicates the run-time behaviour of Futhark
programs, which we would rather avoid. Instead, our solution is that
whenever the index functions differ structurally, we insert a copy to
normalise:
let (B_mem: mem) (B: [n][n]t -> B_mem @ RowMajor(n,n)) =
if c then copy A
else copy (transpose A)
We can also parameterise over the memory block here to avoid having to copy in the first branch, but w’ll focus on index functions for this post.
Incidentally, such a copy is also inserted when an array with a
non-RowMajor
index function is returned from an entry point, because
we require that all arrays passed in or out of Futhark are in
row-major order, as otherwise the external representation would need
to somehow encode the index function.
Recognising array layouts
Index functions only address the problem of computing the offset for a single element, but sometimes the compiler wishes to generate code for bulk operations, where taking a more high level view can be more efficient.
Suppose we have an array with index function F1
, and we want to copy
its elements to form a new array with index function F2
. We assume
that the two index functions have the same index space, as otherwise
such a copy is nonsensical. One obvious way to implement the copy is
to simply iterate through all indexes I
and copy from offset F1(I)
to offset F2(I)
in the respective memory blocks. While this can be
done in parallel, it is most likely not maximally efficient. If the
two index functions represent the same element order, and they are
contiguous, then the copy can be done as a more efficient
memcpy
-like bulk copy, which might take advantage of
hardware-specific optimisations. Similarly, when copying from a
row-major to a column-major index function, it is much more
efficient to make use of a locality-optimised implementation of
transposition than it is to use a naive loop (even if parallel). These
are special cases, but they are rather common, and so it is crucial
that we are able to reliably detect them. This is where a very
flexible index function representation becomes awkward, as it can be
difficult to analyse long chains of transformations and recognise
known structures.
The Futhark compiler largely relied on pattern-matching index functions to detect these important special cases for copies, but it turned out to be unviable when we set out to implement more complicated access pattern analyses, such as needed for Philip Munksgaard’s work on short circuiting. We needed a simpler and more uniform representation of index functions than the purely structural one.
Introducing LMADs
Linear Memory Address Descriptors (LMADs) were introduced by Paek, Hoeflinger, and Padua in their paper Efficient and Precise Array Access Analysis from 2002, where it was used for analysing the access patterns of imperative loops for the purpose of automatic parallelisation. Some years ago, Futhark co-creator Cosmin Oancea realised that LMADs might also be useful as a uniform representation of index functions, since they come with robust prior work on how to analyse them.
LMADs are quite simple. An LMAD for a q-dimensional array consists
of an integral offset o
and q pairs (n,s)
, where n
is the
non-negative size of the dimension and s
is the integral stride.
The pairs are conventionally written juxtaposed, as:
o + (n1,s1)(n2,s2)...(nN,sN)
To apply an LMAD to a q-dimensional index, we simply multiply each
index component with the corresponding stride, sum the resulting
terms, and add the offset. For example, an array of shape [2][3]
in
column-major order and zero offset would be written as the LMAD
0 + (2,3)(3,1)
and application to an index (i,j)
is done by computing
i*3 + j*1 + 0
An LMAD can represent many (but not all) of the typical index space
transformations. For sample, we can fix any given dimension
(“indexing”) by multiplying the index with the corresponding stride
and adding it to the offset. For example, if we have a two-dimensional
array A
with LMAD
o + (n1,s1)(n2,s2)
then the LMAD of the array A[i]
would be
(o + i*s1) + (n2,s2)
where the parentheses simply serve to clarify that the offset is
separate from the pairs. Transposition (or any permutation of
dimensions) is done simply by permuting the LMAD. Negative strides are
useful for expressing reversals. The array reverse A
(where the
outermost dimension is reversed) would be given by this LMAD:
(o + (n1-1)*s1) + (n1,-s1)(n2,s2)
While LMADs do in principle allow zero strides, we do not provide any mechanism for creating them in Futhark, as they have unfortunate implications for in-place updates - generally, we do not allow dimensions to overlap, although this is in principle allowed by the LMAD representation.
LMADs have strong similarities to dope vectors (and for all I know might have been used under a different names previously), but with the common difference that the stride of a dimension does not necessarily imply anything about the innermost dimensions. One attractive quality is that any LMAD for a q-dimensional array, no matter how complicated, can be represented with 2q+1 integers. This also makes it easy to parameterise them, because those integers are all there is to parameterise. For any q, the structure is completely fixed (and Futhark does not allow rank polymorphism so q is always known).
Using LMADs, the if
-and-transpose
example above can now be solved
as follows:
let (s1: i64, s2: i64, B_mem: mem) (B: [n][n]t -> B_mem @ 0 + (n,s1)(n,s2)) =
if c then (n, 1, A)
else (1, n, transpose A)
The two branches now return the appropriate strides that are then
referenced in the LMAD 0 + (n,s1)(n,s2)
. No copies! The copying
special cases can also be recognised fairly easily at run-time simply
by inspecting the concrete values of the LMADs, since they are
structurally similar.
Multiple LMADs
Unfortunately, not all layouts can be expressed with LMADs. In
particular, the restriction to a uniform stride along each dimension
means some index space transformations are inexpressible. For example,
consider first transposing a matrix, then flattening it to a single
dimension: flatten (transpose A)
. If the original value of A
is
[[1,2,3],
[4,5,6]]
and if we suppose it is stored in memory in row-major order as
[1,2,3,4,5,6]
then semantically the transformed value should be
[1,4,2,5,3,6]
but there is no way to write a one-dimensional LMAD that produces the necessary offsets.
However, this can be handled if we use an index function representation that allowed a composition of multiple LMADs.
The way we index a composition of two LMADs L1∘L2
is by first
applying the index to L2
, yielding a flat offset, unravelling that
offset into the index space of L1
, then applying L1
to that now
multi-dimensional index. In particular, L1
and L2
need not agree
on how many dimensions exist in their index space. The cost of the
unravelling is somewhat significant (a division, modulo, and
subtraction per dimension), although probably still less than the cost
of a memory access, except for very high-dimensional arrays.
We used the composition-of-LMADs representation for a few years, as we
were reluctant to to perform copies when flatten
(and unflatten
)
were combined with transpositions, but we eventually realised that we
still had many of the problems caused by our initial structural index
functions. In practice, the compiler would throw up its hands and give
up on analysing any index function that comprised more than a single
LMAD, and the compiler had lots of hacks and ad-hoc code to detect the
cases when a single LMAD would be sufficient. Eventually we decided
that a few extra copies was a small price to pay in return for
significantly simplifying the representation, and so we finally moved
to single-LMAD index
functions. Our journey
from complexity to simplicity had left a little more representational
scar tissue that I haven’t gotten into, which we
gradually removed over
a few further pull
requests, and were
finally left with an index function being a single LMADs, with no
additional information
The end result was a significant simplification of the compiler - both conceptually and technically - and with no performance regressions outside of contrived test programs.
One reason we could get away with this is that index functions are not
the only way in which we optimise array accesses. If you write
something like let B = flatten (transpose A)
, then yes, that will
cause a copy if B is not optimised away entirely. There are many
other places in the compiler where operations on B can be fused,
simplified, inlined, or otherwise adjusted to directly operate on A
instead. Although when specifying worst case behaviour in the cost
model, we must of course not promise more than we can
guarantee,
so reshaping now in theory a copying operation - which is a bit
unusual among array languages.