Futhark still lacks many of the syntactical niceties that one might desire in a programming language. Therefore, the examples here are more verbose than ideal. Still, they should provide a taste of what (simple) Futhark programs look like. If you have ideas for syntax improvements, you can always contribute! For more examples, you can check the examples directory in the Futhark repository, or at our implemented benchmarks.
As Futhark is a functional language, we will start with the obligatory factorial program:
let fact (n: i32): i32 = reduce (*) 1 (map (1+) (iota n)) let main (n: i32): i32 = fact n
The function call
fact n creates an array of the integers
0..n-1, adds one to each element of the array, then computes the product of all elements in the array. The Futhark compiler employs loop fusion to remove the need for any of these temporary arrays to be actually created. Technically
fact n does not compute
n!, but rather
n! mod 2**32, as
i32s are 32 bit in size and will rapidly overflow for large
If we put the above program in a file
fact.fut, we can compile it using the OpenCL backend as such:
$ futhark-opencl fact.fut
If all goes well, this produces an executable program
fact in the current directory. Similarly, we can compile to sequential C code with
futhark-c. As mentioned in the introduction, Futhark is not intended to be used for writing standalone programs, but it is supported in order to enable testing and benchmarking. A standalone program will expect to be given its arguments on standard input:
$ echo 2000000000 | ./fact
This may produce a bunch of diagnostics on standard error, but will also write the result on standard output (
0i32 - a 32-bit zero). We can use the built-in instrumentation to determine how long the computation took, not counting GPU context initialisation and the like:
$ echo 2000000000 | ./fact -t runtime.txt
runtime.txt will contain the wall time in microseconds. On a GTX 780 Ti GPU,
fact(2000000000) (two billion!) runs in 7.0ms . A sequential C program using a
for-loop to compute the same thing takes 1335.3ms on an Intel Xeon E5-2650 CPU. Of course, this is not a realistic performance comparison, as neither program accesses memory, but it shows how easy it is to obtain parallel execution in Futhark. If we ask the Futhark compiler to generate sequential C code (with
futhark-c), the resulting program runs in exactly the same time as the hand-written C program.
A More Complex Example
A more interesting example is the maximum segment sum problem (MSS), where we wish to determine the maximum sum of any contiguous subsequence of an array of integers. We can implement this in Futhark using a combination of
let max (x: i32) (y: i32): i32 = if x > y then x else y let redOp ((mssx, misx, mcsx, tsx): (i32,i32,i32,i32)) ((mssy, misy, mcsy, tsy): (i32,i32,i32,i32)): (i32,i32,i32,i32) = ( max mssx (max mssy (mcsx + misy)) , max misx (tsx+misy) , max mcsy (mcsx+tsy) , tsx + tsy) let mapOp (x: i32): (i32,i32,i32,i32) = ( max x 0 , max x 0 , max x 0 , x) let main (xs: i32): i32 = let (x, _, _, _) = reduce redOp (0,0,0,0) (map mapOp xs) in x
One interesting aspect about this program is that it involves a reduction with an operator that is associative, but not commutative. Associativity is a requirement for the parallel execution of reductions, but commutativity is not required. Yet, for reasons of implementation difficulty, many parallel languages and libraries will malfunction if the reduction operator is not commutative. Futhark supports non-commutative operators, as we have found that many interesting problems (such as MSS above) cannot be solved efficiently with just commutative reductions.
On a GTX 780 Ti GPU, Futhark can compute the MSS of ten million integers in 1.2ms. Much of the runtime is spent transposing the input array in order to ensure optimal memory access patterns during the actual reduction, which is necessary when compiling non-commutative reductions. The performance benefit compared to just using a scan (which never assumes commutative operators) is illustrated elsewhere.
Gaussian Blur Stencil
One common pattern of array computation is the so-called stencil, where we change the value of an element in the array based on its neighbours. For example, we might implement image blurring by assigning each pixel the average value of all of its neighbors. Futhark does not have a special-purpose stencil language construct. Instead, stencil computations are expressed as
maps on the index space, using explicit array indexing to access the stencil source array and returning the new value for the index. While this is rather verbose, at least until Futhark grows more syntactical conveniences, it works and performs well. Let’s look at how to implement a simple image blurring program.
We will represent an image as a three-dimensional array
[rows][cols]u8. The innermost size-3 dimension encodes the three colour channels for red, green, and blue, respectively. When blurring, it is useful to operate on each colour channel separately. Furthermore, instead of the colour being a number from 0 to 255, it is more convenient to store it as a floating-point number between 0 and 1.0. Therefore, we define a function that transforms an array of type
[rows][cols]u8 into three arrays of type
[rows][cols]f32 each. The result is that we have one array for each of the three colour channels:
let splitIntoChannels(image: [rows][cols]u8): ([rows][cols]f32, [rows][cols]f32, [rows][cols]f32) = unzip(map (\row -> map (\pixel -> (f32(pixel) / 255f32, f32(pixel) / 255f32, f32(pixel) / 255f32)) row) image)
splitIntoChannels maps across each inner
u8 element (
pixel), turns this into a triple instead of a three-element array, then uses
unzip to turn the resulting array-of-triples into a triple-of-arrays, which is then returned. For readability, we could have chosen to explicitly indicate the return and parameter types of the anonymous function, but in the interest of brevity we have left them for the compiler to infer. It is only required to explicitly indicate the types of all top-level functions.
We will also need to re-combine the colour channel arrays into a single array. That function looks like this:
let combineChannels(rs: [rows][cols]f32, gs: [rows][cols]f32, bs: [rows][cols]f32): [rows][cols]u8 = map (\rs_row gs_row bs_row -> map (\r g b -> [u8(r * 255f32), u8(g * 255f32), u8(b * 255f32)]) rs_row gs_row bs_row) rs gs bs
Another thing we will need is the actual stencil function. That is, the function we wish to apply to every pixel in the image. For blurring, we will take the average value of the pixel itself plus each of its eight neighbors (nine values in total):
let newValue(image: [rows][cols]f32, row: i32, col: i32): f32 = unsafe let sum = image[row-1,col-1] + image[row-1,col] + image[row-1,col+1] + image[row, col-1] + image[row, col] + image[row, col+1] + image[row+1,col-1] + image[row+1,col] + image[row+1,col+1] in sum / 9f32
The function call
newValue(image, row, col) computes the new value for the pixel at position
(row, col) in
The alert reader will have noticed that
newValue cannot be applied to pixels on the edge of the image - doing so would result in out-of-bounds accesses to the
image array. We will take care to only call the
newValue function with safe indices, but the Futhark compiler is sadly not yet smart to realise this - thus we are forced to use the
unsafe keyword to prevent the insertion of bounds checks that would otherwise hinder parallelisation. If we did not use
unsafe, the Futhark compiler would fail with an error message pointing at the problematic array access.
Now we can write the actual stencil function, which applies
newValue to every inner element of a colour channel array. The edges are left unchanged:
let blurChannel(channel: [rows][cols]f32): [rows][cols]f32 = map (\row -> map(\col -> if row > 0 && row < rows-1 && col > 0 && col < cols-1 then newValue(channel, row, col) else channel[row,col]) (iota cols)) (iota rows)
You may have heard that branches are expensive on a GPU. While this is a good basic rule of thumb, what is actually expensive is branch divergence - that is, when neighboring threads take different paths through a branch. In our stencil, only the edge elements will take the false branch, and these are few in number compared to the interior.
Stencil computations usually have an outer (sequential) loop for applying the stencil several times. Our program is no different - we will apply the blurring transformation a user-defined number of times. The more iterations we run, the more blurred the image will become:
let main(iterations: i32, image: [rows][cols]u8): [rows][cols]u8 = let (rs, gs, bs) = splitIntoChannels(image) loop ((rs, gs, bs)) = for i < iterations do let rs = blurChannel(rs) let gs = blurChannel(gs) let bs = blurChannel(bs) in (rs, gs, bs) in combineChannels(rs, gs, bs)
main function is quite simple. We split the input image into three different channels, use a sequential loop to blur each colour channel the requested number of times, then recombine the resulting channel arrays into a single final image.
loop construct merits an explanation: in the above function, we declare three loop variant variables,
bs. These take their initial values from the incidentally identically named variables in scope (but this is not in general requirement). The loop body then returns three values that become the values of the loop variant variables in the next iteration of the loop. In essence, the
loop construct is just syntactical suger for a particularly simple (but common) pattern of tail-recursive function. However, the Futhark compiler is able to perform transformations involving
loops that it cannot for recursive functions (although it does not perform any such for this simple program).
The three separate calls to
blurChannel may seem wasteful, but the Futhark compiler is smart enough to fuse them together into a single GPU kernel that traverses the three colour channel arrays simultaneously. This is an instance of horisontal fusion.
Our Futhark program is now done. The full commented source code is located here: blur.fut. We can make it a little more useful by writing a small Python wrapper program for reading and writing PNGs: blur-png.py. We must compile
blur.fut using the PyOpenCL backend:
$ futhark-pyopencl --library blur.fut
This produces a Python module
blur.py which is then imported by
blur-png.py. We can try it out on any PNG image, say, this illustration of the spirit of Futhark:
$ python blur-png.py gottagofast.png --output-file gottagofast-blurred.png
Which produces this slightly smushed image. We can also ask for a hundred iterations:
$ python blur-png.py gottagofast.png --output-file gottagofast-blurred.png --iterations 100
Which produces this blurry mess. Notice the edges - perhaps simply keeping them unchanged is not the best way to implement image blurring. Still, this program is a decent description of how to implement stencils in Futhark. For performance measurements on a slightly more complicated stencil, see HotSpot on the performance page.