When we need to look for the position of an element in a sorted sequence, we can use a binary search to perform the lookup in O(log(n)) steps instead of the O(n) that would normally be required. The difference is often large enough that even if the sequence isn’t sorted in advance, it may be worth sorting it if we need to do many lookups. Since Futhark is a parallel language, this is the case we care about.
In Futhark it is pretty straightforward to implement a polymorphic function that given a less-than-or-equal comparison operator performs a binary search for an element in an (assumed) sorted array:
def binary_search [n] 't (lte: t -> t -> bool) (xs: [n]t) (x: t) : i64 = let (l, _) = loop (l, r) = (0, n-1) while l < r do let t = l + (r - l) / 2 in if x `lte` xs[t] then (l, t) else (t+1, r) in l
This function returns the index where the
x element would occur
if it occurs at all. Examples:
> binary_search (<=) [1,3,5,7,9,11] 0 0i64 > binary_search (<=) [1,3,5,7,9,11] 9 4i64 > binary_search (<=) [1,3,5,7,9,11] 2 1i64
Note that when we look for
2, we return the index
contains the value
3. Callers must look up the returned index
and check whether it is actually pointing to the element they were
looking for. That’s perhaps a bit unwieldy, and we could use
Futhark’s sum types to make this a bit nicer.
While the binary search shown above is asymptotically optimal, it is known to be inefficient in practice, because it jumps all over memory, yielding very poor cache locality. This tutorial (based on this paper) shows that if we lay out the array in BFS-order, then the levels of the search tree are stored consecutively in memory, yielding much better locality. The tutorial also goes into detail about various other low-level optimisations, but we will focus on the alternative data layout, in particular whether it is useful when generating GPU code.
While GPUs have caches, they are not usually as crucial as when doing CPU programming. What is more important is memory coalescing, which is roughly about ensuring that neighbouring threads access neighbouring memory addresses in the same clock cycle, as this allows full utilisation of the very wide memory bus of the GPU. We can’t expect coalescing when executing something as unstructured as a binary search, but maybe using an Eytzinger representation can help a bit.
First we need to implement a function for constructing an Eytzinger array (which models a search tree) from a sorted array. The tutorial shows an elegant, but sequential, C function for doing this. No good for Futhark. Fortunately, for input sizes that are one less than a power of two (so they fill out an entire binary tree), there is a closed formula that given an index i in an n-element Eytzinger array produces the index of the corresponding element in the original sorted array. It is based on computing the following derived quantities (don’t worry about the details if you find them fiddly - they are fiddly):
lvl: the number of edges from the root to this element. Equivalently, the number of full levels of the search tree that precede this element. This can be computed as the binary logarithm of i+1 (rounded up).
offset: the number of elements in the sorted array that separate each element of this level in the Eytzinger tree. This quantity halves for every level in the tree.
k: index of the last element of the level above the one containing i. This means that i-k is index of i within the current level of the tree.
We can implement this in Futhark as follows. First we define an efficient integer binary logarithm function by using the count leading zeroes primitive function:
def log2 x = 63 - i64.clz x
Then we can define
def eytzinger_index (n: i64) (i: i64) = let lvl = log2 (i+1) let offset = i64.i32 (1<<(log2 n-lvl)) let k = i64.i32 ((1<<lvl)-1) in offset + (i-k) * offset * 2 - 1
eytzinger function applies
every integer from 0 to n-1, and reads the corresponding
element from the sorted array
def eytzinger [n] 't (xs: [n]t) : [n]t = let f i = xs[eytzinger_index n i] in tabulate n f
Constructing the Eytzinger array is the only really fiddly part of this. The search function can be ported fairly directly from the C function, although we do have to implement our own find first bit function, since it’s not a Futhark primitive:
def ffs x = i64.ctz x + 1
Now we can define
eytzinger_search, where we assume
xs is an
def eytzinger_search [n] 't (lte: t -> t -> bool) (xs: [n]t) (x: t) : i64 = let k = loop k = 1 while k <= n do if x `lte` xs[k-1] then 2*k else 2*k+1 in (k >> i64.i32 (ffs (!k)))-1
Alright, let’s benchmark this. Unfortunately, while Futhark’s benchmarking tool supports randomly generated input of any desired size, it does not provide any way of generating sorted input, which we need here. We thus have to generate the input manually. First, we need a function for sorting. We should of course use a proper sorting library, but it’s awfully tempting to just use the radix sort example that we already have in this very same subdirectory…
module sort = import "radix-sort" entry sorted (xs: i32) = xs |> map u32.i32 |> sort.radix_sort |> map i32.u32
The only quirk is that our radix sort operates on unsigned 32-bit integers, so we need to convert back and forth. This is fine as long as we are careful not to make our input too large. Now we can compile the program:
$ futhark opencl binary-search.fut
Then we generate a 2²⁵-1-element array with futhark
pass it to the
sorted function, and dump the resulting array to
$ futhark dataset -b --i32-bounds=0:33554430 -g 'i32' | ./binary-search -b -e sorted >input
I am rather curious about how long it takes to construct the
Eytzinger array, so we define an entry point that calls
eytzinger, and also provide an input block that will be picked up
-- == -- entry: to_eytzinger -- compiled input @ input
entry to_eytzinger : i32 -> i32 = eytzinger
But more importantly, we use the
to_eytzinger entry point to
generate a file containing the Eytzinger array:
$ ./binary-search -b -e to_eytzinger <input >input_eytzinger
Then we define two entry points that search using either the naive strategy, or the Eytzinger array. In both cases we also use the input array as the keys to search for. I don’t think this skews the results, and it ensures we search for every element in the array.
-- == -- entry: bench_binary_search -- compiled input @ input
entry bench_binary_search [n] (xs: [n]i32) = map (binary_search (<=) xs) xs
-- == -- entry: bench_eytzinger_search -- compiled input @ input_eytzinger
entry bench_eytzinger_search [n] (xs: [n]i32) = map (eytzinger_search (<=) xs) xs
Finally, as a point of comparison, we also want to measure how long it would take to just copy the input:
-- == -- entry: bench_copy -- compiled input @ input
entry bench_copy [n] (xs: [n]i32) = copy xs
Now we are ready for benchmarking, with 100 runs per entry point:
$ futhark bench binary-search.fut --backend=opencl -r 100
These results are from an NVIDIA RTX 2080 Ti GPU:
Results for binary-search.fut:to_eytzinger: input: 1698μs (RSD: 0.009; min: -2%; max: +3%) Results for binary-search.fut:bench_binary_search: input: 2670μs (RSD: 0.003; min: -1%; max: +1%) Results for binary-search.fut:bench_eytzinger_search: input_eytzinger: 2107μs (RSD: 0.009; min: -2%; max: +4%) Results for binary-search.fut:bench_copy: input: 534μs (RSD: 0.006; min: -1%; max: +4%)
The Eytzinger approach is only 20% faster than the naive approach, so there is not much advantage. However, it only takes three times as long to compute the Eytzinger array as it takes to copy the input array, so if we want to do an enormous amount of lookups, it may be profitable. It’s not as profitable as on a Xeon E5-2650 CPU, though:
$ futhark bench binary-search.fut --backend=c -r 10 Compiling binary-search.fut... Reporting average runtime of 10 runs for each dataset. Results for binary-search.fut:to_eytzinger: input: 417177μs (RSD: 0.003; min: -0%; max: +0%) Results for binary-search.fut:bench_binary_search: input: 4293823μs (RSD: 0.021; min: -2%; max: +4%) Results for binary-search.fut:bench_eytzinger_search: input_eytzinger: 2749538μs (RSD: 0.025; min: -2%; max: +6%) Results for binary-search.fut:bench_copy: input: 56820μs (RSD: 0.008; min: -1%; max: +2%)