Futhark 0.7.1 released, now with C# backend and efficient histograms on GPU
Futhark 0.7.1 has just been released—full changelog at the usual place. This release was primarily motivated by ICFP 2018 which takes place next week, and where two members of the Futhark team—Martin Elsman and myself—will be giving presentations. Functionalitywise, this is another big release, as this post will hopefully show.
There is a major breaking change in that we deleted the vast majority of the basis library, and instead migrated it to external packages that are accessed via the package manager we introduced in 0.6.3 (more here and here). The motivation behind this change is to allow more active evolution of libraries without tying them to the release schedule of the compiler, and to permit library experimentation without requiring specific compiler versions. The only library code that remains bundled with the compiler is the implicitly imported default prelude, which mostly wraps compiler intrinsics and defines extremely simple utility functions (such as head
or take
) that can only be reasonably implemented in one way.
Apart from breaking all programs that depend on the basis library, the 0.7.1 release brings three major features. One of them is the new interpreter/debugger, and the two others are discussed below.
C# Backend
The Futhark compiler aggressively offloads computation to the GPU via the OpenCL API. OpenCL works by having an ordinary program running on the CPU that transmits code and data to the GPU (or any other accelerator, but we’ll stick to GPUs). In the ideal case, the CPUcode is mostly glue that performs bookkeeping and making API calls. No matter the language the CPU code is written in, the GPU code will be written in OpenCL C and translated at program initialisation to whatever machine code is needed by the concrete GPU.
Since the early days, we exploited this property to provide a PyOpenCL backend that makes it very convenient to call Futhark code from Python. Thanks to the work of Mikkel Storgaard Knudsen—a nowgraduated masters student at DIKU—we not also have a similar C#/OpenCL backend. While the PyOpenCL backend generates code that is in some cases a few factors slower than our C/OpenCL backend, the C#/OpenCL backend generates code with barely any performance difference. Full details are in Mikkel’s thesis (which also discusses the implementation of a compiler from a subset of F# to Futhark).
The main limitation right now is how arrays are passed from C# to Futhark and back again. This is done by copying them to the CPU in the form of a flattened onedimensional array, along with a separate array containing the dimensions of the original multidimensional array. Not exactly a pleasant or efficient calling convention. We’ll come up with a more sophisticated design once we get some more experience with how the C# interop can be most useful. The C and PyOpenCL backends do not suffer from this problem, so we know how to solve it on a technical level.
Histogram Computations
Futhark is fundamentally a lambda calculusderived language, where definition and application of functions are the building blocks on top of which everything else is built. However, Futhark’s parallel prowess comes from a small collection of builtin functions, called SOACs, that are not expressed in terms of more primitive concepts, but are instead handled directly by the compiler. These SOACs require significant implementation effort (every part of the compiler must know about them) and we therefore try to limit their number. Until recently, they were limited to map
, scan
, reduce
, and scatter
(and some streaming variants), which can handle most problems efficiently (often by using them to build more powerful abstractions). However, sometimes we come across problems that cannot be efficiently handled with any combination of the existing SOACs. If these problems occur sufficiently frequently, or are instances of an underlying general pattern, we may be inspired to add a new SOAC. This occured with a parallel pattern sometimes called a generalised reduction in the compiler literature, but that most people will recognise as a histogram.
A histogram computation takes as input a collection of elements, maps each to one of k bins, and counts the number of elements that fall into each bin (discarding invalid bins). In imperative pseudocode, we might write it as:
for x in xs:
bin = to_bin(x)
if bin >= 0 && bin < k:
bins[bin] = bins[bin] + 1
This is the most common form of a histogram. However, we can generalise it to use any arbitrary function f to update the chosen bin:
for x in xs:
bin = to_bin(x)
if bin >= 0 && bin < k:
bins[bin] = f(bins[bin], x)
For example, we might pick f to be a max
function, which will find the maximum element for every bin. If the function f is commutative (f(x,y) = f(y,x)) and associative (f(x,f(y,z)) = f(f(x,y), z)), then all iterations of the loop can in principle be evaluated in parallel (as long as we are careful to update the bin atomically). Viewed this way, a histogram computation performs k reductions in parallel, which is the source of the term generalised reduction.
Histograms in Futhark
So how did we previously express the above in Futhark? For simplicity, let us stick to the original formulation where we are simply counting occurrences of integers. We can write it as a sequential loop with inplace updates secured by Futhark’s uniqueness types:
let histogram_seq [n] (k: i32) (is: [n]i32): [k]i32 =
loop h = replicate k 0 for i in is do
if i >= 0 && i < k then h with [i] < h[i] + 1
else h
Unfortunately, there is no parallelism here, and the compiler will not let us perform updates on an externally defined array within a parallel loop, as it is not safe in general.
One option is performing a sequential loop over all the bins, and count for each bin the number of elements that go into it:
let histogram_loop [n] (k: i32) (is: [n]i32): [k]i32 =
let bucket i = is > map ((==i) >> i32.bool) > i32.sum
in map bucket (0..<k)
This is very efficient for small k as this kind of segmented reduction is handled efficiently by the Futhark compiler, but it is asymptotically inefficient: O(nk) instead of O(n).
Futhark contains some unusual streaming combiners that can be used to assign a sequential operation to some number of thread (the exact count is left up to the compiler and runtime), and combine the perthread results into a single result. We can use this to implement the histogram as follows:
let histogram [n] (k: i32) (is: [n]i32): [k]i32 =
stream_red_per (map2 (+)) (histogram_seq k) is
Since each thread produces a histogram (a vector), we use vector addition (map2 (+)
) to combine the perthread results. Unfortunately, this implementation requires the creation of a sizek histogram per thread, which is fine for low k, but disastrous for larger k, since a GPU will typically require tens of thousands of threads in order to be saturated. For all but small k, we will pay a huge premium in memory.
A scalable solution turns out to be rather complicated to write, and not all that performant in practice. The following implementation first sorts the elements by bin with a radix sort (from github.com/dikudk/sorts), then uses an irregular segmented reduction (from github.com/dikudk/segmented) to compute the number of elements in each bin:
let histogram_reduction [n] (k: i32) (is: [n]i32) =
let num_bits = t32 (f32.ceil (log2 (r32 k)))
let is' = radix_sort num_bits i32.get_bit is
let flags = map2 (!=) is' (rotate (1) is')
in segmented_reduce (+) 0 flags (replicate n 1)
Actually, this implementation is not entirely correct, as it assumes that no bin is empty. Note that it does not use k at all. Another problem is that the radix sort is quite an expensive operation (and the segmented reduction is not particularly cheap either). Clearly something was missing from Futhark.
The solution was a new SOAC, reduce_by_index
, which was designed and implemented by yet another DIKU student, Sune Hellfritzsch. With this, we can implement the histogram as follows:
let histogram_reduce_by_index [n] (k: i32) (is: [n]i32): [k]i32 =
let bins = replicate k 0
in reduce_by_index bins (+) 0 is (replicate n 1)
The arguments to reduce_by_index
are, in order, the initial contents of the bins, the bincombining operator (f), a neutral element for the operator, the bin assignments for each element, and finally the elements themselves.
The compiler uses some clever tricks to handle reduce_by_index
. Mostly, it uses lowlevel atomic functions to efficiently update the bins with a minimum of synchronisation. It can even exploit that certain common operators (like 32bit integer addition) are supported directly by the hardware (atomic_add()
in OpenCL). However, reduce_by_index
can handle any operator, by falling back to a CASbased locking mechanism. While not nearly as efficient as when direct hardware support is available, it at least scales decently. Furthermore, for small k, the implementation of reduce_by_index
uses multiple histograms that are updated independently and only combined at the end, in order to minimise expensive conflicts where multiple threads try to update the same bin at the same time.
Scalability is good. The following table lists various values of k and n and the time it takes to compute the corresponding histogram on a Vega 64 GPU:
k  n  Time in μs  k  n  Time in μs 


2²⁰ 


2²⁰ 


2²² 


2²² 


2²⁴ 


2²⁴ 


2²⁰ 


2²⁰ 


2²² 


2²² 


2²⁴ 


2²⁴ 

128  2²⁰ 


2²⁰ 

128  2²² 


2²² 

128  2²⁴ 


2²⁴ 

Note how the runtime is (almost) invariant of k. For this benchmark, I compute the bin assignments as map (%k) (iota n)
, which provides perfectly regular accesses. In practice, scaling will likely not be as smooth, as high k likely implies more irregular memory accesses. As a comparison, a straightforward sequential C implementation takes 112,374μs for k=8, n=2²⁴ on my Ryzen 7 1700X that’s over 180 times as long.
Finally, there is one case where reduce_by_index
still performs poorly, namely when k is so large that only a single histogram array is used by all threads, yet only a few bins are actually used. This will cause collisions and hence poor performance. So, don’t do that.