-- # Binary search
--
-- When we need to look for the position of an element in a sorted
-- sequence, we can use a [binary
-- search](https://en.wikipedia.org/wiki/Binary_search_algorithm) to
-- perform the lookup in *O(log(n))* steps instead of the *O(n)* that
-- would normally be required. The difference is often large enough
-- that even if the sequence isn't sorted in advance, it may be worth
-- sorting it if we need to do many lookups. Since Futhark is a
-- parallel language, this is the case we care about.
--
-- In Futhark it is pretty straightforward to implement a polymorphic
-- function that given a less-than-or-equal comparison operator
-- performs a binary search for an element in an (assumed) sorted
-- array:
def binary_search [n] 't (lte: t -> t -> bool) (xs: [n]t) (x: t) : i64 =
let (l, _) =
loop (l, r) = (0, n-1) while l < r do
let t = l + (r - l) / 2
in if x `lte` xs[t]
then (l, t)
else (t+1, r)
in l
-- This function returns the index where the `x` element would occur
-- *if it occurs at all*. Examples:
--
-- ```
-- > binary_search (<=) [1,3,5,7,9,11] 0
-- 0i64
-- > binary_search (<=) [1,3,5,7,9,11] 9
-- 4i64
-- > binary_search (<=) [1,3,5,7,9,11] 2
-- 1i64
-- ```
--
-- Note that when we look for `2`, we return the index `1`, which
-- contains the value `3`. Callers must look up the returned index
-- and check whether it is actually pointing to the element they were
-- looking for. That's perhaps a bit unwieldy, and we could use
-- [Futhark's sum types](sum-types.html) to make this a bit nicer.
--
-- While the binary search shown above is *asymptotically* optimal, it
-- is known to be inefficient in practice, because it jumps all over
-- memory, yielding very poor [cache
-- locality](https://en.wikipedia.org/wiki/Locality_of_reference).
-- [This tutorial](https://algorithmica.org/en/eytzinger) (based on
-- [this paper](https://arxiv.org/pdf/1509.05053.pdf)) shows that if
-- we lay out the array in
-- [BFS](https://en.wikipedia.org/wiki/Breadth-first_search)-order,
-- then the levels of the search tree are stored consecutively in
-- memory, yielding much better locality. The tutorial also goes into
-- detail about various other low-level optimisations, but we will
-- focus on the alternative data layout, in particular whether it is
-- useful when generating GPU code.
--
-- While GPUs have caches, they are not usually as crucial as when
-- doing CPU programming. What is more important is [memory
-- coalescing](https://cvw.cac.cornell.edu/GPU/coalesced), which is
-- roughly about ensuring that neighbouring threads access
-- neighbouring memory addresses in the same clock cycle, as this
-- allows full utilisation of the very wide memory bus of the GPU. We
-- can't expect coalescing when executing something as unstructured as
-- a binary search, but maybe using an Eytzinger representation can
-- help a bit.
--
-- First we need to implement a function for constructing an Eytzinger
-- array (which models a search tree) from a sorted array. The
-- tutorial shows an elegant, but sequential, C function for doing
-- this. No good for Futhark. Fortunately, for input sizes that are
-- one less than a power of two (so they fill out an entire binary
-- tree), there is a closed formula that given an index *i* in an
-- *n*-element Eytzinger array produces the index of the corresponding
-- element in the original sorted array. It is based on computing the
-- following derived quantities (don't worry about the details if you
-- find them fiddly - they *are* fiddly):
--
-- * *lvl*: the number of edges from the root to this element.
-- Equivalently, the number of full levels of the search tree that
-- precede this element. This can be computed as the binary
-- logarithm of *i+1* (rounded up).
--
-- * *offset*: the number of elements in the sorted array that
-- separate each element of this level in the Eytzinger tree. This
-- quantity halves for every level in the tree.
--
-- * *k*: index of the last element of the level above the one
-- containing *i*. This means that *i-k* is index of *i* within the
-- current level of the tree.
--
-- We can implement this in Futhark as follows. First we define an
-- efficient integer binary logarithm function by using the *count
-- leading zeroes* primitive function:
def log2 x = 63 - i64.clz x
-- Then we can define `eytzinger_index` itself:
def eytzinger_index (n: i64) (i: i64) =
let lvl = log2 (i+1)
let offset = i64.i32 (1<<(log2 n-lvl))
let k = i64.i32 ((1< t -> bool) (xs: [n]t) (x: t) : i64 =
let k =
loop k = 1 while k <= n do
if x `lte` xs[k-1]
then 2*k
else 2*k+1
in (k >> i64.i32 (ffs (!k)))-1
-- Alright, let's benchmark this. Unfortunately, while Futhark's
-- [benchmarking
-- tool](https://futhark.readthedocs.io/en/stable/man/futhark-bench.html)
-- supports randomly generated input of any desired size, it does
-- *not* provide any way of generating *sorted* input, which we need
-- here. We thus have to generate the input manually. First, we need
-- a function for sorting. We should of course use a [proper sorting
-- library](https://github.com/diku-dk/sorts), but it's awfully
-- tempting to just use the [radix sort example](radix-sort.html) that
-- we already have in this very same subdirectory...
module sort = import "radix-sort"
entry sorted (xs: []i32) =
xs |> map u32.i32 |> sort.radix_sort |> map i32.u32
-- The only quirk is that our radix sort operates on unsigned 32-bit
-- integers, so we need to convert back and forth. This is fine as
-- long as we are careful not to make our input too large. Now we can
-- compile the program:
--
-- ```
-- $ futhark opencl binary-search.fut
-- ```
--
--
-- Then we generate a *2²⁵-1*-element array with [futhark
-- dataset](https://futhark.readthedocs.io/en/stable/man/futhark-dataset.html),
-- pass it to the `sorted` function, and dump the resulting array to
-- the file `input`.
--
-- ```
-- $ futhark dataset -b --i32-bounds=0:33554430 -g '[33554431]i32' | ./binary-search -b -e sorted >input
-- ```
-- I am rather curious about how long it takes to construct the
-- Eytzinger array, so we define an entry point that calls
-- `eytzinger`, and also provide an input block that will be picked up
-- by `futhark bench`:
-- ==
-- entry: to_eytzinger
-- compiled input @ input
entry to_eytzinger : []i32 -> []i32 = eytzinger
-- But more importantly, we use the `to_eytzinger` entry point to
-- generate a file containing the Eytzinger array:
--
-- ```
-- $ ./binary-search -b -e to_eytzinger input_eytzinger
-- ```
-- Then we define two entry points that search using either the naive
-- strategy, or the Eytzinger array. In both cases we also use the
-- input array as the keys to search for. I don't think this skews
-- the results, and it ensures we search for every element in the
-- array.
-- ==
-- entry: bench_binary_search
-- compiled input @ input
entry bench_binary_search [n] (xs: [n]i32) =
map (binary_search (<=) xs) xs
-- ==
-- entry: bench_eytzinger_search
-- compiled input @ input_eytzinger
entry bench_eytzinger_search [n] (xs: [n]i32) =
map (eytzinger_search (<=) xs) xs
-- Finally, as a point of comparison, we also want to measure how long
-- it would take to just copy the input:
-- ==
-- entry: bench_copy
-- compiled input @ input
entry bench_copy [n] (xs: [n]i32) =
copy xs
-- Now we are ready for benchmarking, with 100 runs per entry point:
--
-- ```
-- $ futhark bench binary-search.fut --backend=opencl -r 100
-- ```
--
-- These results are from an NVIDIA RTX 2080 Ti GPU:
--
-- ```
-- Results for binary-search.fut:to_eytzinger:
-- input: 1698μs (RSD: 0.009; min: -2%; max: +3%)
--
-- Results for binary-search.fut:bench_binary_search:
-- input: 2670μs (RSD: 0.003; min: -1%; max: +1%)
--
-- Results for binary-search.fut:bench_eytzinger_search:
-- input_eytzinger: 2107μs (RSD: 0.009; min: -2%; max: +4%)
--
-- Results for binary-search.fut:bench_copy:
-- input: 534μs (RSD: 0.006; min: -1%; max: +4%)
-- ```
--
-- The Eytzinger approach is only 20% faster than the naive approach,
-- so there is not much advantage. However, it only takes three times
-- as long to compute the Eytzinger array as it takes to copy the
-- input array, so if we want to do an enormous amount of lookups, it
-- may be profitable. It's not as profitable as on a Xeon E5-2650
-- CPU, though:
--
-- ```
-- $ futhark bench binary-search.fut --backend=c -r 10
-- Compiling binary-search.fut...
-- Reporting average runtime of 10 runs for each dataset.
-- Results for binary-search.fut:to_eytzinger:
-- input: 417177μs (RSD: 0.003; min: -0%; max: +0%)
--
-- Results for binary-search.fut:bench_binary_search:
-- input: 4293823μs (RSD: 0.021; min: -2%; max: +4%)
--
-- Results for binary-search.fut:bench_eytzinger_search:
-- input_eytzinger: 2749538μs (RSD: 0.025; min: -2%; max: +6%)
--
-- Results for binary-search.fut:bench_copy:
-- input: 56820μs (RSD: 0.008; min: -1%; max: +2%)
-- ```