Futhark 0.25.3 released
Futhark 0.25.3 has been released (full changelog here. Apart from a plethora of bug fixes, there are two major additions:
A simplification of the memory representation in the IR, which often results in fewer copies in run-time, as well as lower memory usage. I am working on a separate post about this topic, so I won’t go into more detail here.
An entirely new GPU backend targeting AMD’s HIP API, which is essentially the CUDA API re-implemented for both NVIDIA and AMD GPUs.
You can go download the new version of Futhark from one of these package repositories (or install it yourself), or you can continue reading this post for more on the new backend.
What is HIP?
When NVIDIA released CUDA in 2007, it was a major advance of the state of the art of GPGPU programming. Before CUDA, programmers had to awkwardly express non-graphics computation as if they were graphics operations. Now they could directly work with a fairly straightforward (if low-level) data parallel model, using normal data types and control flow. CUDA quickly became a success, and since it was a proprietary API that worked only on NVIDIA GPUs, this was of course a big problem for other GPU vendors, most notably AMD.
The story of AMDs attempts to popularise various alternatives to CUDA
is long and tedious, but the most successful attempt was arguably
OpenCL: an open standard for
“accelerator programming” (not just GPUs), which was also adopted and
implemented by NVIDIA. OpenCL typically runs just as fast as CUDA,
but it is much more awkward for a human programmer than the CUDA C++
language implemented by CUDA’s nvcc
compiler, mainly because OpenCL
does not allow intermixing of GPU and CPU code in the same compilation
unit. Further, NVIDIA invested heavily in excellent tooling and CUDA
libraries, so even though OpenCL does see significant use, CUDA is
easily (and deservedly) the most popular GPGPU API.
AMD finally decided that there was no realistic chance of supplanting
CUDA, and so instead decided to make their own CUDA at home. Instead
of straight up implementing the CUDA API, HIP is essentially CUDA,
except with the word cuda
in API functions replaced by hip
.
Further, two implementations of the HIP library exists: one that runs
on AMD GPUs, and one that runs on top of CUDA itself (without
overhead, due to the extreme API similarity). There are some small
differences, mostly that HIP does not have CUDA’s confusing and mostly
historical runtime/driver API
distinction,
but these are in the lower layers that most programmers do not
directly interact with. Further, AMD also built
hipcc
, which
implements a single-source multi-device language very similar to CUDA
C++, and which can be compiled to run on either NVIDIA or AMD GPUs.
Finally, AMD released tools such as
hipify
that can be
used to automatically convert CUDA programs to HIP. Ultimately, HIP is
a GPGPU programming model that is as flexible and convenient as CUDA,
but is portable to both AMD and NVIDIA GPUs.
HIP and Futhark
I think AMD is on the right track with HIP, although time will tell whether it actually ends up making a difference. But what is the importance of all this to Futhark? After all, Futhark has both an OpenCL and a CUDA backend, so Futhark programs are already portable. Why do we want a HIP backend?
The answer is that although OpenCL is not exactly dead, it is somewhat
stagnant when it comes to exposing new facilities. In contrast, since
CUDA is under the full control of NVIDIA, they can immediately expose
fancy new hardware features. While some of them (like ray
tracing) are too
application-specific to be useful for the Futhark compiler, things
like warp shuffles and fine-grained synchronisation are very useful
when it comes to generating efficient code. But the thing that pushed
me over the edge was Robin Voetter
showing that HIP allowed for the implementation of a decoupled
lookback
scan
on AMD GPUs - something that I had never managed to make reliable on
OpenCL, as it depends on somewhat exotic memory consistency guarantees
(we already had it for our CUDA backend). As scan
is perhaps the
most important primitive for advanced irregular data parallelism, this
gave me the motivation to put together a HIP backend for
Futhark. And it paid
off - most scan
s are easily 4x as fast with the HIP backend compared
to OpenCL, on the same AMD GPU.
While I was fiddling with GPU code generation, I also took the time to refactor the common parts of the different GPU backend run-time systems into a common utility library. The idea is to make it much easier to support new GPU APIs, and indeed the run-time parts of the HIP backend constitute only 900 lines of C, most of which is HIP-specific boilerplate initialisation and configuration code. Now this is certainly a best case situation, as HIP is conceptually very similar to both CUDA and OpenCL, but I hope it can still be useful for supporting e.g. Vulkan or WebGPU.
Although some bugs undoubtedly remain undiscovered and unfixed, the
HIP backend is fully functional, tested, and used in my own work.
Using it is as simple as typing futhark hip
. When compiling
executables, it is currently hard-coded to use only the AMD
implementation of HIP, but if for some reason users would prefer to
compile to HIP instead of CUDA, it would be straightforward to add a
command line option for doing so.
While I’m often critical of AMD’s software efforts, I must give credit when credit is due: HIP is good. In fact, it is really good. The deviations from CUDA (in obscure corners) tend to be cleanups of historical baggage. I’ve liked AMD ever since they started open sourcing their drivers, and it is nice to see that their software is improving not just ethically, but also technically. I hope HIP will increase competition in the GPGPU space.