How we test(ed) the Futhark compiler

Posted on June 10, 2024 by Troels Henriksen

In this post I will go through the evolution of the tools we use for testing the Futhark compiler. Along the way, these also became futhark test, the user-facing tool for testing Futhark programs, and although rather convenient, it is still pretty crude compared to the testing tools you will find in other languages. This post is perhaps most interesting to other language implementation hobbyists.

Every program starts with a shell script

In the earliest days, when you ran a Futhark program it would read things from standard input and write things on standard output. (Actually, in the extremely early days, Futhark actually supported impure functions for doing this, although this was quickly changed to implicitly reading the arguments to the main function, driven by their types.) At some point we realised that we weren’t sure that the compiler worked, and so we decided to add a way of running our test programs.

The vision was simple: each program could be associated with an .in file that would be fed to the program on stdin, and an .out file that would be compared against whatever the program on stdout. If no .in file was present, it was assumed that the program contained a type error and should be rejected. This early support for negative tests was remarkable considering how crude the system as a whole was - here is the shell script that implements it.

You will notice, for example, that the test programs are explicitly listed (sort of) in the test runner itself. Also, the script does not actually run the programs - it merely compiles them. I vaguely remember that this testing script is derived from one I wrote when I was a TA on DIKU’s compiler course (which is incidentally also the root origin of Futhark), but I am pretty sure that one actually ran the test programs.

It took about a week to actually reach the point where the rest runner tested the programs - looking at the commit history, I spent most of that week making it use more compiler passes, and presumably only implemented running programs once the compiler stopped crashing. Note that the expected and actual program output is compared exactly, with cmp(1). We will return to this.

A few weeks later I added support for only type checking the test programs, which is very handy when hacking on the type checker. I suppose this represents the point where we had added enough tests (or made the compiler slow enough) that actually running the entire test suite was too annoying. This option is still supported in futhark test.

Moving to Haskell

This code didn’t last very long however, as soon after I rewrote the test runner in Haskell. I actually thought the shell script lasted a lot longer than a few weeks, but the Git log doesn’t lie. This test runner was still not a part of Futhark proper however, but a completely separate Haskell program you ran with runhaskell. The main advantage was that it would perform multiple tests concurrently, as compared to the totally sequential shell script. Looking at its architecture, it is actually very similar to the modern futhark test tool. For unclear reasons, it took almost a month before we actually switched to using the Haskell-runner by default. Good bye, shell script.

The next major change was adding a terrible CI setup. Specifically, I piggybacked on an IRC bot I was running to manage a lunch club with fellow students. This IRC bot was already set up to handle IRC notifications from GitHub whenever someone updated its code. It was a small amount of work to also make this lunch management bot receive notifications from the Futhark (well, L0 in those days) repository and run the CI suite. CI failures were reported by sending me an email. It was super flaky, and I don’t remember why I didn’t do something more sensible. This was in December of 2013, so hardly the stone age when it comes to CI. It wasn’t until over a year later that we finally started using Travis-CI (and eventually in 2020, GitHub Actions).

But back to the test runner. Its main problem at this point was that it did an exact comparison of expected and actual output, which meant it was very sensitive to pretty-printing details (this all predates the binary data format). It was also very annoying when we started having multiple backends that did not guarantee bit-level consistency for things like floating point arithmetic. The first (weak) attempt at solving this was ignoring whitespace when comparing outputs, but it wasn’t until February 2015 that the test runner actually started considering the output to be Futhark values, rather than just arbitrary strings, in order to compare floating-point values within an allowable tolerance. It’s quite remarkable how long it took to get rid of some of the misdesigns of that initial shell script.

By this point, the test runner run both the interpreter and compiler. This was a very good design, as having two implementations of the language forced us to think about what programs were actually supposed to do, rather than what they happened to do by accident. However, eventually we started writing test programs that were too slow to run in the interpreter, so it became possible to turn them off. This is the start of the test runner growing a confusing array of different modes in which it can be run. A few weeks later, we added the obvious counterpart option that only runs the interpreter. These options are still supported in futhark test.

The first large change to the design of how test programs were specified came in April 2015, and removed the part where a program foo.fut was associated with foo.in and foo.err. The problem (apart from being annoying to juggle multiple files) was that it only allowed a single input/output pair for each program. Instead, the test input and expected output would now be embedded in the program itself, in the form of a comment, as shown here:

// --
// input {
//   10
// }
// output {
//    [ 0 , 1 , 1 , 2 , 3 , 5 , 8 , 13 , 21 , 34  ]
// }

This is still essentially the design we use today, although // became -- and -- became ==. It did prove awkward for test programs with very large expected outputs, but was a marked improvement for the rest.

In the same month, the ad-hoc Haskell script tucked away in a sub-directory finally became futhark-test, and was built using the same build system as the main compiler. This was the start of providing testing tools that could be used by Futhark programmers to test their own programs (as opposed to compiler hackers testing the compiler), although it was still quite crude, and almost completely undocumented. (But anyway, you wouldn’t want to use 2015-era Futhark.)

The opening comment in futhark-test.hs was pretty clear about the audience:

-- | This program is a convenience utility for running the Futhark
-- test suite, and its test programs.

Beyond correctness testing

Until this point, the program then named futhark-test had been focused on functional correctness - when run, did the programs produce the right output? However, at this point we had spent a lot of time adding optimisation passes, and (unwittingly) also on breaking optimisation passes. Our testing tool could easily detect when an optimisation produced incorrect code (helped by the internal consistency checks in the compiler), but it had no way of detecting when we broke an optimisation pass to such an extent that it simply did not change the program.

Inspired by some tests I found in the LLVM test suite, I became interested in being able to express that the program resulting from optimisation should some loosely specified shape. I didn’t want to hardcode the entire expected representation, because the fine details about variable naming and operation ordering are not (necessarily) intended to be stable, but I wanted to say things like “this program should consist of two loops at top level, the latter with a reduction inside”.

Instead of doing the work myself, I proposed the construction of such a tool as a Bachelor’s Project at the department (the start of a beautiful tradition), and the skilled Maya Saietz developed a tool where you provided a high level AST “pattern”, and the tool check checked whether that pattern matched the program produced by the compiler. Maya’s work was quite good, but a bit overkill - and it was difficult to express that a given construct was not allowed to appear in some position (e.g., demand the absence of bounds checks). Instead, I added a “quick hack” (still around in essentially unchanged form, naturally) where you can simply specify how many of a specific kind of AST node are allowed in the program. For example, this might specify the absence of bounds checks:

structure { Assert 0 }

And this might specify that the program contains only a single SOAC (which we might use to test fusion, another thing that broke often):

structure { SOAC 1 }

Nesting was handled by naming the nodes appropriately. For example, a SOAC inside another SOAC would be tallied as a SOAC/SOAC node (as well as two plain SOACs - the outermost and the innermost one). This “quick hack” has proven extremely useful, and perhaps merits a more detailed post one day.

We also added a notion of “tags”, where programs (and workloads) could be associated with arbitrary tags, which would be excluded for a specific run with an --exclude option. This was (and is) used for programs that do not work with a specific backend, or for datasets that are too slow. (The sequential Python backend, despite ostensibly being “compiled”, is particularly slow for nontrivial workloads.)

Benchmarking

In early 2016, Futhark was becoming good enough that we started wondering how fast it was - so we added futhark-bench. It accepted the same kinds of files as futhark-test, but instead of merely checking for correctness, it ran each program a bunch of times and reported the resulting performance. In later years, when I have had to benchmark other systems, I have often appreciated that we (relatively) early on decided on a single fully automated way in which Futhark programs should be benchmarked.

Around this time, since we wanted to benchmark programs with large (ish) amounts of data, the MSc student Rasmus Wriedt Larsen designed and implemented a binary data format that we still use (almost) unchanged. Again, simply specifying a simple format (which is incidentally also easy to read and write from other languages) has proven enormously convenient. For example, the futhark dataset and futhark datacmp tools can be used for generation, conversion, and comparison of data files. At this point, test programs were also able to indicate that the input (or output) data should be fetched from another file, rather than being embedded in the program itself.

We have a bunch of auxiliary tools for benchmarking purposes, such as futhark benchcmp that can be used for comparing results, and plot-bench.py for visualising results, but I will leave a discussion of those for another post.

Server mode

At this point, futhark-test and futhark-bench were essentially complete. Certainly, small things were added, such as testing multiple entry points in the same program, or the ability to test the presence of specific compiler warnings, and they eventually became subcommands under the names futhark test and futhark bench, but a modern Futhark programmer could probably use them without noticing much difference.

Since the beginning, testing had been based on a simple operational principle: a Futhark program read input on stdin and produced output on stdout. The format had changed from text to binary, and options had been added through which the program would report its runtime, but running a Futhark function still implied starting the program from scratch. It also meant testing was limited to the values that could be represented in the binary data format, which was exclusively arrays of primitive values - no records, tuples, or sum types.

Another problem with this approach was that it could be slow. Initialising the GPU context could easily take several seconds (potentially much more for large programs), which was detrimental for programs that consisted on a large number of functions with (relatively) small individual runtimes, like this reduction microbenchmark. It would be better if a Futhark program could be kept running (and GPU context intact), and have it run multiple entry points during its lifetime. This was added in early 2021 in the form of Futhark server mode. The most visible feature provided by server mode was futhark literate (now used for our collection of examples), but it also had significant impact on testing and benchmarking.

In particular, it was now faster - a lot faster for benchmarks with many datasets, like the microbenchmark linked above. But a more subtle change was that input needed no longer be expressed in as arrays of primitives. Instead, it could be computed with another Futhark function, and passed directly to the entry point being tested, without having to go through a deserialisation/serialisation step. (And even when necessary, the server protocol also came with support for serialising arbitrary Futhark values to byte streams.) This meant we could now write test programs like this:

-- ==
-- entry: doeswork
-- script input { mkdata 100i64 } output { 5050.0f32 }
-- script input { mkdata 10000i64 }
-- script input { mkdata 1000000i64 }

entry mkdata n = (n,map f32.i64 (iota n))

entry doeswork n arr = f32.sum arr + f32.i64 n

Here the doeswork function is the actual function being tested, and mkdata is some arbitrary function that generates data. The test stanza will run the expression mkdata 100i64 to generate the input (the runtime is not counted when benchmarking), which is then passed to doeswork. While the example above is trivial, this is very useful for providing arbitrarily complicated input data (although the result must still be of a simple type). It is also quite useful that we can test or benchmark with large amounts of synthetic data, without having to store it on disk, by generating it dynamically in Futhark.

What about unit tests?

All of the discussion above concerns integration testing, where the full functionality (or close to it) of the compiler is exercised. What about unit testing, where the functionality of individual small components is verified? I am hardly averse to unit testing in general, but for compiler development, I must admit that I find integration tests to provide more bang for the buck. In particular, compiler passes often take quite complicated inputs (entire ASTs), for which the representation is not stable. Keeping unit tests for such functions updated is not impossible, but it is time consuming.

By writing small test programs, and perhaps augmenting with things like the structure tests above, we can obtain many of the benefits of unit testing, with less maintenance overhead. Futhark does have a small collection of unit tests, mainly for verifying the functionality of particularly subtle core components, like graph colouring or index function operations (for the array representation). However, compared to the 2293 test programs, the 2985 source lines of unit tests do look pretty meagre.

We also have a collection of manually written library tests for testing the C and Python APIs, as well as some completely ad hoc tests that are just shell scripts that check various things in the auxiliary tooling. These are not terribly interesting.

Wrapping up

Futhark is a project that even in the best case will never have many resources. Therefore, all our tooling has been built with a sense of minimalism and simplicity, and often sacrificing flexibility. Yet I think we have ended up with tools that work pretty well for our purposes, and where we can usually get our work done. The main missing feature is some form of property based testing, which I hope we can add one day.

Although not originally developed for that purpose, the testing and benchmarking tools are also fairly easy to explain to users. In most other languages, testing tools are libraries that are built in the language itself. This is not possible Futhark, because it is not a general-purpose language, so we had to build tools that are outside the language itself. This can easily result in tools that are just plain bad, but I think we managed to do alright.