Futhark 0.21.9: now with language server and adaptive benchmarking

Posted on April 15, 2022 by Troels Henriksen

I no longer write blog posts for every single release. This is because most releases simply contain bugfixes or optimisations that might matter to the affected users, but don’t make for very interesting reading to the uninitiated (see for yourself). Because let’s be realistic - most people reading these blog posts are not Futhark programmers, in the same sense that they also don’t drive a Ferrari.

So what makes this release merit a blog post? It has a handful of features that I not only find exciting, but that we also obtained with little effort from my side. See, most of Futhark’s development occurs at the University of Copenhagen, where students are required to undertake a project at the end of their undergraduate and master’s programmes. Doing such projects are among the most memorable experiences of my own time as a student, and I want new generations to have the same experience. Going beyond what I accomplished myself, I also want the students to have a chance to contribute to a real project (if an obscure one), with actual users and people who truly care about the quality of their work, because they will make actual use of it.

Or to take a more cynical view, we exploit a source of forced labour provided to us by the university. This is by itself not new - Futhark development has long made use of students both as labour and as human test subjects. The unusual part this time around is that two students managed to complete a meaningful subset of their work early enough for it to be included in a compiler release, before they had to finish their project. This will permit a kind of real-world testing that is not normally feasible for these projects. I think we got lucky by a combination of motivated students and focused and constrained projects. It’s probably not something we can reliably repeat in the future, but it certainly is a nice experience.

On to the features!

Holes

The first new feature is holes, developed by Sun Haoran as part of his master’s thesis. Holes are a feature that I believe originally comes from dependently typed languages such as Idris, although Futhark’s version takes more inspiration from Flix.

Essentially, a hole is written as ??? and represents a placeholder expression of unknown type. The type checker will infer the type of the hole and tell you what it is through a warning. In a language with a sufficiently fancy type system and tooling, this can even be used to suggest expressions that can fill the hole. Futhark is not quite at that level, so here holes function more like undefined in Haskell. This allows a style of type-driven programming where you define the types of the functions in your design, but don’t actually give them definitions until later when the design is done. First you build the skeleton, then you put on the meat. I use this approach often in Haskell. It’s a simple feature, but it can be quite pleasant to use.

You can compile and run a program with holes, although it will fail at run-time if the hole is executed. This can be useful for putting holes in branches that you haven’t thought about how to handle yet.

The Futhark language server

Typed holes are not really what Sun Haoran’s project is about, and was more of a warmup exercise. His real work is the creation of a language server, which can be used to power IDE-like features in various editors. VS Code, the originator of the Language Server Protocol (LSP), is likely still the most prominent user, but language servers also work well in an elegant editor for a more civilised age. Language servers can become incredibly advanced, but the one for Futhark is still fairly basic. It currently supports “hover information”, such as showing the type of the variable under the cursor, as well as jumping to its definition. The foundations are sound, so we can gradually grow the feature set over time. The server comes with the compiler and can be run as futhark lsp, but unless you enjoy typing the LSP wire format by hand, you are better off using the VS Code language extension, or perhaps a more civilised tool.

The major deficiency in the language server, which will not be easy to fix, is that it relies on a compiler frontend that is not very error tolerant. If the Futhark program has a type or syntax error, then the compiler will be unable to produce any partial information about the program. The language server has some workarounds for this, such as caching the information available when the program was last well-typed, but a proper solution requires significant re-engineering of the type checker.

Adaptive benchmarking

While the language server is likely the most interesting feature to most Futhark users, I often spend more time measuring than writing Futhark code. Therefore I am quite excited about Aleksander Junges work on improving futhark bench, the built-in benchmarking tool.

In general, benchmarking is hard. You need to take into account the specifics of the program, the machine on which it is run, ensuring nothing else is running concurrently (unless that’s what you want), that your testing workloads are representative of real work, and so on. The goal of futhark bench is not so solve these very tricky and open-ended problems, but rather to provide a simple tool that will in most cases do the right thing.

Previously, our benchmarking methodology was as follows: For each user-provided workload, first do a single unmeasured warmup run, then do a fixed number of runs (10 by default), collect and store their individual runtimes for possible later processing, and report their arithmetic mean to the user.

Before we discuss the limitations of this methodology, it’s important to keep in mind that there are several equally valid targets for benchmarking:

Performance of the first run, perhaps even including startup time.
Performance of the fastest single run.
Performance of the slowest single run.
Sustained average performance over many runs.

All of the above can be valid in different contexts. For Futhark, we believe that sustained performance is what matters most. Experience has taught us that most Futhark programs are executed in tight loops, so futhark bench is built to measure that kind of usage. Some languages and users might have other needs.

The problem with the fixed-number-of-runs approach is that there is no guarantee that results from 10 runs will be predictive of future performance. For workloads that finish quickly (e.g. a millisecond), we observed that performance would be much better if we asked futhark bench to perform hundreds or thousands of runs. Conversely, we sometimes saw that performance would slowly decrease until it reached some stable plateau. Most vexing, some particularly bizarre benchmarks showed bistable behaviour, where the observed performance would oscillate between different plateaus of stable (and sometimes quite different) performance.

Here is an example of such a situation (in this case created artificially by Aleksander Junge). The X axis contains the run number, and the Y axis the runtime in microseconds for that specific run:

Such behaviour can be explained: Warmup periods, background tasks, scheduling decisions, overheating, and other complex hardware behaviour. But for such complicated problems, we academics are taught a powerful phrase: “outside the scope of this paper”. Identifying, yet alone correcting for, the sources of noise is well outside the scope of futhark bench. The tool should only put in the effort to ensure that such patterns, if they occur, will be part of the collected measurements, so they can be analysed by something or someone else. Concretely, futhark bench should keep an eye on the results collected so far, and if they look sufficiently wacky (not a formal statistical term), automatically decide to collect more measurements until there is some confidence that the dataset fully describes the wackiness.

Therefore, futhark bench now uses the following two-phase methodology:

The initial phase performs ten runs, or repeated runs for at least half a second, whichever takes longer. If the resulting measurements are sufficiently statistically robust (determined using standard deviation and autocorrelation), the results are produced and the second phase is not entered. Otherwise, the results are discarded and the second phase entered.
The convergence phase keeps performing runs until a measurement of sufficient statistical quality (again based on standard deviation and autocorrelation) is reached, which we call convergence.

The reason we throw away the measurements from the initial phase when entering the convergence phase is that we have noted that these measurements tend to statistically behave like outliers, likely because they correspond to measurements from some kind of “warmup period”. Keeping them around just means that phase 2 takes longer to converge. If it turns out they are not outliers, then they will occur again during the convergence phase, and so discarding them should have done no harm. (Also note our focus on sustained performance - if we assume that the program will run for arbitrarily long, then it is acceptable to throw away any prefix of the measurements.)

In principle, the safest approach would be to continue measuring for a very long time - say, an hour. But this would make futhark bench supremely annoying to use. With the new approach, running the entire benchmark suite (about 50 programs, most with multiple workloads) on an AMD MI100 GPU takes about 16 minutes. Most programs don’t even enter the convergence phase, and those that do tend to converge quite quickly.

Ultimately this approach is a heuristic that balances accuracy and practicality, but Aleksander Junge has done a pretty decent investigation on statistical methods in order to develop it. We have experimentally determined that the results obtained are close to the “perfect” results that we obtain by just doing a huge number of runs.

More projects

The projects discussed above are not the only Futhark projects currently worked on by our students. Walter Restelli-Nielsen and Amar Topalovic are working on rewriting our fusion engine, long one of the gnarliest parts of the compiler, to more faithfully resemble the graph algorithm that it supposedly implements. This not only fixes some bugs, but also lays a foundation for even more aggressive fusion in the future.

Another project by Louis Marott Normann, Kristoffer August Kortbæk, William Pema Norbu Holmes Malling, and Oliver Bak Kjersgaard Petersen, is developing an ISPC backend for Futhark. This is a very complex undertaking; partly because the domain is inherently tricky, and partly because ISPC was not intended as a target for code generation (they find many bugs). Performance on some benchmarks is promising, but I think it will take a lot of work before Futhark can reliably generate efficient SIMD code for most programs.

While these last two projects are not yet developed enough to be part of this compiler release, I hope and expect that they will be part of a future one. And for sure, their works will be put on display in the halls of Valhalla.