Fork me on GitHub

Futhark 0.26.3 released - now with property-based testing

Posted on May 27, 2026

I just tagged a new release of Futhark - version 0.26.3. The attentive reader will of course notice that the middle version number has increased compared to the last release announcement, which we use to indicate that we have broken backwards compatibility somehow. But also, given that this release is 0.26.3, it is clearly not the release that did so. I actually did write a blog post on why we broke compatibility, so that’s not the topic of this post. Rather, I want to talk about the flagship feature of the new release, property-based testing, courtesy of the two students Matin Nafar and Simon August Mørk (as is tradition).

Property-based testing is a strategy to testing software where, instead of providing predefined input-output pairs, you provide a property that should hold (such that a sorted function produces a permutation of the input), and the tool then randomly generates inputs, looking for an example that will make the property fail. It’s one of those techniques that are not guaranteed to work, but tend to work pretty well in practice. Haskell’s QuickCheck library popularised the idea, but today it exists for many languages - it is my impression that the Erlang ecosystem is perhaps the most enthusiastic user.

I actually did already write a blog post on a certain facet of the design. For those following along at home, we eventually resolved this design question through a combination of option 4 and 5. To cut to the chase, this is how we now define a minimal test program with a property stating that all integers are even:

-- ==
-- property: all_i32s_are_even

#[prop]
entry all_i32s_are_even (x: i32) = x % 2 == 0

If we pass this program to futhark test, it will chew for a bit and dutifully reply:

PBT FAIL: all_i32s_are_even size=50 seed=18384861974223310616 after 1 tests
Counterexample: 655892471i32

(Specific message format subject to change.)

I really like the ergonomics of this. It’s not as if QuickCheck for Haskell is particularly hard to use, but it requires more ceremony than this. Because Futhark values are ultimately transparent to the tooling (value orientation!), futhark test can do automatic generation and shrinking of values. Is it very good? No, probably not yet: I have reason to suspect there are smaller non-even numbers than 655892471, although the question remains open for now. The system is extensible however, so if we wish, we can define our own shrinking function:

-- ==
-- property: all_i32s_are_even

entry shrink_i32 (x: i32) (seed: u64) = x >> i32.min 32 (i32.u64 seed)

#[prop(shrink(shrink_i32))]
entry all_i32s_are_even (x: i32) = x % 2 == 0

A shrinker for a type t is a function of type t -> u64 -> t, where the u64 is a random number that the shrinker can use to make decisions - here it is how much I shift the integer by. With this, we now get a startlingly smaller result:

PBT FAIL: all_i32s_are_even size=50 seed=11373889006282598657 after 1 tests
Counterexample: 1i32

We will probably have to spend more effort on improving automatic shrinking and generation of values (in particular, the generator for arrays will currently always produce square arrays), but I believe it is possible to make it quite practical for many cases.

If we wish, we can define our own value generators. This is particularly useful when we have internal invariants, particularly on array sizes, that are not visible to the automatic value generator. A generator for functions of type t is a function of type i64 -> u64 -> t, taking a size and random seed as arguments. Example:

-- ==
-- property: all_i32s_are_even

entry gen_i32 (size: i64) (seed: u64) = i32.i64 size & i32.u64 seed & !1

#[prop(gen(gen_i32))]
entry all_i32s_are_even (x: i32) = x % 2 == 0

And now, through the solid engineering employed in the custom generator, our property now holds.

Writing principled generators manually is too tedious to be practical, so we intend to also develop a small combinator library to make it easier. However, the goal is certainly that most reasonable testing can be done using only the automatic tooling.

I have already begun the process of converting the tests in the various Futhark libraries to be property-based. One reason I am particularly drawn to property-based testing is that it is fun. Writing up input/output examples is tedious and boring, while figuring out a nice correctness property for a sorting function is a pleasant intellectual challenge. Ultimately, I came up with the following helper functions:

-- | How many elements satisfy the predicate?
def count p = map p >-> map i64.bool >-> i64.sum

-- | Correctness property: a sorting function is valid if it produces a sorted
-- sequence, and for every element in the original sequence, the same number of
-- equal elements is present in the original and sorted sequence.
--
-- This is quadratic-time to check, but hopefully this will not be a big problem
-- in practice.
def correct_sort 'a [n] (sort: [n]a -> [n]a) (lte: a -> a -> bool) (xs: [n]a) : bool =
  let xs' = sort xs
  let eql x y = lte x y && lte y x
  in and (tabulate n (\i -> i == 0 || (xs'[i - 1] `lte` xs'[i])))
     && all (\x -> count (eql x) xs' == count (eql x) xs) xs'

The function is then used like this:

#[prop]
entry radix_sort_u16 = correct_sort (radix_sort u16.num_bits u16.get_bit) (<=)

I wonder how many libraries treat property-based testing as completely first class, or perhaps even preferred over input/output tests. I doubt Futhark is the first one. Hopefully we are not in for a rude surprise as to why it’s not more popular.