Futhark 0.25.3 released

Posted on August 30, 2023 by Troels Henriksen

Futhark 0.25.3 has been released (full changelog here. Apart from a plethora of bug fixes, there are two major additions:

A simplification of the memory representation in the IR, which often results in fewer copies in run-time, as well as lower memory usage. I am working on a separate post about this topic, so I won’t go into more detail here.
An entirely new GPU backend targeting AMD’s HIP API, which is essentially the CUDA API re-implemented for both NVIDIA and AMD GPUs.

You can go download the new version of Futhark from one of these package repositories (or install it yourself), or you can continue reading this post for more on the new backend.

What is HIP?

When NVIDIA released CUDA in 2007, it was a major advance of the state of the art of GPGPU programming. Before CUDA, programmers had to awkwardly express non-graphics computation as if they were graphics operations. Now they could directly work with a fairly straightforward (if low-level) data parallel model, using normal data types and control flow. CUDA quickly became a success, and since it was a proprietary API that worked only on NVIDIA GPUs, this was of course a big problem for other GPU vendors, most notably AMD.

The story of AMDs attempts to popularise various alternatives to CUDA is long and tedious, but the most successful attempt was arguably OpenCL: an open standard for “accelerator programming” (not just GPUs), which was also adopted and implemented by NVIDIA. OpenCL typically runs just as fast as CUDA, but it is much more awkward for a human programmer than the CUDA C++ language implemented by CUDA’s nvcc compiler, mainly because OpenCL does not allow intermixing of GPU and CPU code in the same compilation unit. Further, NVIDIA invested heavily in excellent tooling and CUDA libraries, so even though OpenCL does see significant use, CUDA is easily (and deservedly) the most popular GPGPU API.

AMD finally decided that there was no realistic chance of supplanting CUDA, and so instead decided to make their own CUDA at home. Instead of straight up implementing the CUDA API, HIP is essentially CUDA, except with the word cuda in API functions replaced by hip. Further, two implementations of the HIP library exists: one that runs on AMD GPUs, and one that runs on top of CUDA itself (without overhead, due to the extreme API similarity). There are some small differences, mostly that HIP does not have CUDA’s confusing and mostly historical runtime/driver API distinction, but these are in the lower layers that most programmers do not directly interact with. Further, AMD also built hipcc, which implements a single-source multi-device language very similar to CUDA C++, and which can be compiled to run on either NVIDIA or AMD GPUs. Finally, AMD released tools such as hipify that can be used to automatically convert CUDA programs to HIP. Ultimately, HIP is a GPGPU programming model that is as flexible and convenient as CUDA, but is portable to both AMD and NVIDIA GPUs.

HIP and Futhark

I think AMD is on the right track with HIP, although time will tell whether it actually ends up making a difference. But what is the importance of all this to Futhark? After all, Futhark has both an OpenCL and a CUDA backend, so Futhark programs are already portable. Why do we want a HIP backend?

The answer is that although OpenCL is not exactly dead, it is somewhat stagnant when it comes to exposing new facilities. In contrast, since CUDA is under the full control of NVIDIA, they can immediately expose fancy new hardware features. While some of them (like ray tracing) are too application-specific to be useful for the Futhark compiler, things like warp shuffles and fine-grained synchronisation are very useful when it comes to generating efficient code. But the thing that pushed me over the edge was Robin Voetter showing that HIP allowed for the implementation of a decoupled lookback scan on AMD GPUs - something that I had never managed to make reliable on OpenCL, as it depends on somewhat exotic memory consistency guarantees (we already had it for our CUDA backend). As scan is perhaps the most important primitive for advanced irregular data parallelism, this gave me the motivation to put together a HIP backend for Futhark. And it paid off - most scans are easily 4x as fast with the HIP backend compared to OpenCL, on the same AMD GPU.

While I was fiddling with GPU code generation, I also took the time to refactor the common parts of the different GPU backend run-time systems into a common utility library. The idea is to make it much easier to support new GPU APIs, and indeed the run-time parts of the HIP backend constitute only 900 lines of C, most of which is HIP-specific boilerplate initialisation and configuration code. Now this is certainly a best case situation, as HIP is conceptually very similar to both CUDA and OpenCL, but I hope it can still be useful for supporting e.g. Vulkan or WebGPU.

Although some bugs undoubtedly remain undiscovered and unfixed, the HIP backend is fully functional, tested, and used in my own work. Using it is as simple as typing futhark hip. When compiling executables, it is currently hard-coded to use only the AMD implementation of HIP, but if for some reason users would prefer to compile to HIP instead of CUDA, it would be straightforward to add a command line option for doing so.

While I’m often critical of AMD’s software efforts, I must give credit when credit is due: HIP is good. In fact, it is really good. The deviations from CUDA (in obscure corners) tend to be cleanups of historical baggage. I’ve liked AMD ever since they started open sourcing their drivers, and it is nice to see that their software is improving not just ethically, but also technically. I hope HIP will increase competition in the GPGPU space.