-- # Binary search -- -- When we need to look for the position of an element in a sorted -- sequence, we can use a [binary -- search](https://en.wikipedia.org/wiki/Binary_search_algorithm) to -- perform the lookup in *O(log(n))* steps instead of the *O(n)* that -- would normally be required. The difference is often large enough -- that even if the sequence isn't sorted in advance, it may be worth -- sorting it if we need to do many lookups. Since Futhark is a -- parallel language, this is the case we care about. -- -- In Futhark it is pretty straightforward to implement a polymorphic -- function that given a less-than-or-equal comparison operator -- performs a binary search for an element in an (assumed) sorted -- array: def binary_search [n] 't (lte: t -> t -> bool) (xs: [n]t) (x: t) : i64 = let (l, _) = loop (l, r) = (0, n-1) while l < r do let t = l + (r - l) / 2 in if x `lte` xs[t] then (l, t) else (t+1, r) in l -- This function returns the index where the `x` element would occur -- *if it occurs at all*. Examples: -- -- ``` -- > binary_search (<=) [1,3,5,7,9,11] 0 -- 0i64 -- > binary_search (<=) [1,3,5,7,9,11] 9 -- 4i64 -- > binary_search (<=) [1,3,5,7,9,11] 2 -- 1i64 -- ``` -- -- Note that when we look for `2`, we return the index `1`, which -- contains the value `3`. Callers must look up the returned index -- and check whether it is actually pointing to the element they were -- looking for. That's perhaps a bit unwieldy, and we could use -- [Futhark's sum types](sum-types.html) to make this a bit nicer. -- -- While the binary search shown above is *asymptotically* optimal, it -- is known to be inefficient in practice, because it jumps all over -- memory, yielding very poor [cache -- locality](https://en.wikipedia.org/wiki/Locality_of_reference). -- [This tutorial](https://algorithmica.org/en/eytzinger) (based on -- [this paper](https://arxiv.org/pdf/1509.05053.pdf)) shows that if -- we lay out the array in -- [BFS](https://en.wikipedia.org/wiki/Breadth-first_search)-order, -- then the levels of the search tree are stored consecutively in -- memory, yielding much better locality. The tutorial also goes into -- detail about various other low-level optimisations, but we will -- focus on the alternative data layout, in particular whether it is -- useful when generating GPU code. -- -- While GPUs have caches, they are not usually as crucial as when -- doing CPU programming. What is more important is [memory -- coalescing](https://cvw.cac.cornell.edu/GPU/coalesced), which is -- roughly about ensuring that neighbouring threads access -- neighbouring memory addresses in the same clock cycle, as this -- allows full utilisation of the very wide memory bus of the GPU. We -- can't expect coalescing when executing something as unstructured as -- a binary search, but maybe using an Eytzinger representation can -- help a bit. -- -- First we need to implement a function for constructing an Eytzinger -- array (which models a search tree) from a sorted array. The -- tutorial shows an elegant, but sequential, C function for doing -- this. No good for Futhark. Fortunately, for input sizes that are -- one less than a power of two (so they fill out an entire binary -- tree), there is a closed formula that given an index *i* in an -- *n*-element Eytzinger array produces the index of the corresponding -- element in the original sorted array. It is based on computing the -- following derived quantities (don't worry about the details if you -- find them fiddly - they *are* fiddly): -- -- * *lvl*: the number of edges from the root to this element. -- Equivalently, the number of full levels of the search tree that -- precede this element. This can be computed as the binary -- logarithm of *i+1* (rounded up). -- -- * *offset*: the number of elements in the sorted array that -- separate each element of this level in the Eytzinger tree. This -- quantity halves for every level in the tree. -- -- * *k*: index of the last element of the level above the one -- containing *i*. This means that *i-k* is index of *i* within the -- current level of the tree. -- -- We can implement this in Futhark as follows. First we define an -- efficient integer binary logarithm function by using the *count -- leading zeroes* primitive function: def log2 x = 63 - i64.clz x -- Then we can define `eytzinger_index` itself: def eytzinger_index (n: i64) (i: i64) = let lvl = log2 (i+1) let offset = i64.i32 (1<<(log2 n-lvl)) let k = i64.i32 ((1< t -> bool) (xs: [n]t) (x: t) : i64 = let k = loop k = 1 while k <= n do if x `lte` xs[k-1] then 2*k else 2*k+1 in (k >> i64.i32 (ffs (!k)))-1 -- Alright, let's benchmark this. Unfortunately, while Futhark's -- [benchmarking -- tool](https://futhark.readthedocs.io/en/stable/man/futhark-bench.html) -- supports randomly generated input of any desired size, it does -- *not* provide any way of generating *sorted* input, which we need -- here. We thus have to generate the input manually. First, we need -- a function for sorting. We should of course use a [proper sorting -- library](https://github.com/diku-dk/sorts), but it's awfully -- tempting to just use the [radix sort example](radix-sort.html) that -- we already have in this very same subdirectory... module sort = import "radix-sort" entry sorted (xs: []i32) = xs |> map u32.i32 |> sort.radix_sort |> map i32.u32 -- The only quirk is that our radix sort operates on unsigned 32-bit -- integers, so we need to convert back and forth. This is fine as -- long as we are careful not to make our input too large. Now we can -- compile the program: -- -- ``` -- $ futhark opencl binary-search.fut -- ``` -- -- -- Then we generate a *2²⁵-1*-element array with [futhark -- dataset](https://futhark.readthedocs.io/en/stable/man/futhark-dataset.html), -- pass it to the `sorted` function, and dump the resulting array to -- the file `input`. -- -- ``` -- $ futhark dataset -b --i32-bounds=0:33554430 -g '[33554431]i32' | ./binary-search -b -e sorted >input -- ``` -- I am rather curious about how long it takes to construct the -- Eytzinger array, so we define an entry point that calls -- `eytzinger`, and also provide an input block that will be picked up -- by `futhark bench`: -- == -- entry: to_eytzinger -- compiled input @ input entry to_eytzinger : []i32 -> []i32 = eytzinger -- But more importantly, we use the `to_eytzinger` entry point to -- generate a file containing the Eytzinger array: -- -- ``` -- $ ./binary-search -b -e to_eytzinger input_eytzinger -- ``` -- Then we define two entry points that search using either the naive -- strategy, or the Eytzinger array. In both cases we also use the -- input array as the keys to search for. I don't think this skews -- the results, and it ensures we search for every element in the -- array. -- == -- entry: bench_binary_search -- compiled input @ input entry bench_binary_search [n] (xs: [n]i32) = map (binary_search (<=) xs) xs -- == -- entry: bench_eytzinger_search -- compiled input @ input_eytzinger entry bench_eytzinger_search [n] (xs: [n]i32) = map (eytzinger_search (<=) xs) xs -- Finally, as a point of comparison, we also want to measure how long -- it would take to just copy the input: -- == -- entry: bench_copy -- compiled input @ input entry bench_copy [n] (xs: [n]i32) = copy xs -- Now we are ready for benchmarking, with 100 runs per entry point: -- -- ``` -- $ futhark bench binary-search.fut --backend=opencl -r 100 -- ``` -- -- These results are from an NVIDIA RTX 2080 Ti GPU: -- -- ``` -- Results for binary-search.fut:to_eytzinger: -- input: 1698μs (RSD: 0.009; min: -2%; max: +3%) -- -- Results for binary-search.fut:bench_binary_search: -- input: 2670μs (RSD: 0.003; min: -1%; max: +1%) -- -- Results for binary-search.fut:bench_eytzinger_search: -- input_eytzinger: 2107μs (RSD: 0.009; min: -2%; max: +4%) -- -- Results for binary-search.fut:bench_copy: -- input: 534μs (RSD: 0.006; min: -1%; max: +4%) -- ``` -- -- The Eytzinger approach is only 20% faster than the naive approach, -- so there is not much advantage. However, it only takes three times -- as long to compute the Eytzinger array as it takes to copy the -- input array, so if we want to do an enormous amount of lookups, it -- may be profitable. It's not as profitable as on a Xeon E5-2650 -- CPU, though: -- -- ``` -- $ futhark bench binary-search.fut --backend=c -r 10 -- Compiling binary-search.fut... -- Reporting average runtime of 10 runs for each dataset. -- Results for binary-search.fut:to_eytzinger: -- input: 417177μs (RSD: 0.003; min: -0%; max: +0%) -- -- Results for binary-search.fut:bench_binary_search: -- input: 4293823μs (RSD: 0.021; min: -2%; max: +4%) -- -- Results for binary-search.fut:bench_eytzinger_search: -- input_eytzinger: 2749538μs (RSD: 0.025; min: -2%; max: +6%) -- -- Results for binary-search.fut:bench_copy: -- input: 56820μs (RSD: 0.008; min: -1%; max: +2%) -- ```