I just added half-precision floats to Futhark, as the type
This wasn’t particularly hard (perhaps a day of work), but it was
annoying for fairly shallow technical reasons, so as a bit of
catharsis, here’s a blog post about the challenges I encountered.
First of all, why do we even want 16-bit floats? Their precision is horrible and the largest representable value is a mere 65504. Apparently 16-bit floats were used by various exotic processors in the early 80s, but the modern lineage, as eventually standardised by IEEE 754-2008, hails from graphics programming. This is because half-precision floats can be useful for expressing light intensities, as they are accurate enough, while allowing much larger dynamic range than a 16-bit fixed point number. Initially, half-precision was only used as a storage format, but eventually GPUs grew support for operating directly on half-precision floats; often with twice the arithmetic throughput. Since they only take up half as much space in memory as single-precision floats (and hence also use half the memory bandwidth), they are in theory “twice as fast” as single-precision floats.
Half-precision floats have also become increasingly popular for use in machine learning applications, as it appears neural networks are resistant to numerical problems (presumably they just train around them). But this is where things get interesting: there are actually (at least) two half-precision float formats. All take up 16 bits in memory, but differ in how they allocate those bits to significand and exponent. The one we support in Futhark is the binary16 format from IEEE 754-2008, but Google has also developed the bfloat16 format for use in AI accelerators. ARM apparently also supports a nonstandard variation of binary16, but I’m hoping I will never have to learn the details.
So we want to support half-precision floats because some applications can make use of them, either because they take up less space or because computation on them is actually faster.
Half-precision floats are standardised in IEEE 754-2008, but they are
not a universally supported type in common programming languages.
In particular, ISO C does not support them. Other languages have
unusual restrictions where
half is supported as a “storage type”
(meaning you can put them in memory), but you need to convert them to
some other type (e.g.
float) if you actually want to do arithmetic
on them. This is not how we want to do it in Futhark. The
should be fully supported like any other primitive type, and on all
systems. When necessary, the compiler must generate code to paper
over lack of hardware support.
Most of Futhark’s backends generate C code - either plain CPU C,
OpenCL C, or CUDA C. Let’s start with plain C, which does not support
half type. One option would be representing
f16 as a 16-bit
integer and then implementing every single arithmetic operation
manually. People have done so, but
such code runs extremely slowly. Instead, we simply
typedef float f16 and actually perform the
f16 operations in single precision.
However, we don’t want these simulated floats to take up 32 bits per
element in memory, so we distinguish between the scalar type used
for operations, which is
float, and the storage type used for
arrays, which is
uint16_t. When reading an
f16 from an array, we
must use a function to convert from the storage type to the scalar
type - fortunately, the library linked above contains such a function
and its inverse. This means that
f16 operations might produce
different results depending on whether they are emulated or not, as
emulation only performs the rounding at the end, while true
calculations might also round intermediate results. I’m not
sufficiently experienced with numerical analysis, or the guarantees of
IEEE 754, to say whether this is a real problem.
We also have a problem with Futhark’s C
Normally an entry point that returns a scalar will be exposed as a C
function that produces a value of the obvious type (
i32 and so on), but C does not have a
half type. Again we fall
back to the notion of “storage type” and return
f16 results as
uint16_t. Life would be so much easier if
half was universally
Let’s move on to the OpenCL backend. Surely life is easier there!
No, it never is. OpenCL does actually define a
half type, but
only as a storage format, meaning you can have pointers to them, but
not perform any arithmetic, unless the platform in question supports
extension (and you enable it). OpenCL platforms are loaded
dynamically, followed by run-time compilation of kernels, so the
Futhark compiler does not know at compile-time whether the target platform
supports half-precision. Therefore we generate kernel code with
to detect whether the platform supports
cl_khr_fp16, and fall back
float-based emulation otherwise. Remarkably, OpenCL does
provide builtin functions for efficiently translating between
single-precision and half-precision floats stored in
even for those platforms that don’t have
cl_khr_fp16. This lets us
load half-precision floats into single-precision scalars at quite high
speed. I observed about an order of magnitude difference compared to
the hand-written conversion function. This is quite important, since
for some reason, NVIDIA’s OpenCL implementation does not support
cl_khr_fp16, despite their GPUs being fully capable.
As a delightful quirk, the OpenCL emulator
Oclgrind claims to support
cl_khr_fp16, but will fail to actually perform any operations on
half values. Futhark users will never use Oclgrind, but it is an
absolutely indispensable tool for hacking on the compiler, and is used
in our regression test suite as well. The solution is an ad-hoc
check to always emulate half-precision on
Okay, time for CUDA. NVIDIA has been marketing half-precision hard, so surely everything will just work out, right?
So, CUDA does indeed support half-precision floats on devices that are
Compute Capability 6.0 or newer. This can be checked with an
#ifdef. However, for some strange reason, you have to include a
special header file,
cuda_fp16.h, to actually get access to the
half type and its operations. Glancing at the header file, it looks
half is actually implemented as a C++ class with inline PTX
(CUDA assembly) to perform the computations using the natively
supported half-precision instructions. This is quite baroque. Why
half type just implemented in the compiler, like it is in
I suspect the reason is that CUDA is mostly a “single source” programming model where the same program contains both CPU and GPU code, where the CPU code is ultimately compiled with the system compiler - which does not support half-precision floats, and so needs compatibility shims. In contrast, OpenCL C is specifically for “device” code (read: GPU code), and the “host” (CPU) is completely out of its purview. Futhark does not use CUDA in a “single source” manner, but instead uses NVRTC to do run-time compilation of CUDA kernels, very similar to how OpenCL works.
The need to include
cuda_fp16.h is not just a compiler
implementation curiosity - it turns out to present the greatest
annoyance in this entire cavalcade of accidental complexity.
Apparently when you run-time compile a CUDA kernel with NVRTC, the
default search path does not include the CUDA include directory, so
this header cannot be found. Whoever calls the NVRTC API must
explicitly provide the directory containing this header. This is
bad. I’ve long sought to ensure Futhark didn’t contribute to the
growing accidental complexity
disaster, with new
configuration files and environment variables being my main aversion.
But now it seems we have no choice. Futhark must be able to figure
out, at run-time, where CUDA is installed, and there is no standard
for this. Sure, it’s often in
/usr/local/cuda, but it’s fairly
common to put it somewhere in
/opt as well, especially on
supercomputer systems. And there is no standard environment
variable pointing at the CUDA directory! Other GPU applications
respect any or all or
to my great regret, now Futhark does as
This is a tragedy and a shame. There is no good reason that is has to be this way. I know that this will result in things not working down the line, just because a user didn’t set an invisible global variable correctly. I still hope that I just missed a subtle detail somewhere, and that this will eventually go away like a bad dream.
Half-precision floats are the worst IEEE floats in all ways except
speed. So are they faster in Futhark? Sort of. On an A100 GPU, the
f16.sum can sum one billion half-precision floats in
f32.sum takes 2.9ms to sum one billion
single-precision floats. That’s a speedup of 1.38x, which is much
less than the 2x we would theoretically expect. So what’s going
wrong? Two things:
Adding two half-precision floats on their own is probably not faster than adding two single-precision floats. CUDA’s half-precision API strongly implies one should consider using the
half2type, which packs two half-precision numbers in 32 bits, and allows “vectorised” operations on both in a single instruction. As far as I understand current hardware, this is the way half-precision floats can achieve high arithmetic throughput. I don’t know if the CUDA compiler will optimise multiple
half2s on its own, but I doubt it - especially for a reduction.
The analysis above is valid, but it’s not actually what is happening here. This summation is completely bandwidth-bound, so the only thing that matters is how efficiently we access memory. In
f16.sum, each thread reads a 16-bit value at a time. While the Futhark compiler arranges things so that the accesses are fully coalesced, the GPU’s memory architecture is not really designed for individual threads issuing such small reads - perhaps each thread will actually read (at least) 32 bits, then throw half of it away. In any case, the result is that the memory bus is not fully utilised.
A more efficient half-precision reduction would have each thread read
multiple packed values with a single transaction, such as by reading a
half2. This is a trick we’ll have to teach the compiler at
some point. This is also not a new concern - if you write a program
that sums 8-bit or 16-bit integers, you are also not fully utilising
the memory bus, even if in absolute runtime, it’s still faster than
summing the same number of 32-bit integers.
How do we verify that this analysis is correct? We can write a
program that interprets an array of
u32 values as packed
map (\x -> f16.from_bits(u16.u32(x>>
This can sum the equivalent of a billion
f16s (that is, half a
u32s) in a mere 1.5ms - a 1.93x speedup over
Unless you know what you are doing, you probably should not use
in your code. Their numerical properties are absolutely going to bite
you. Interoperability is awkward and the performance isn’t even that
great yet. But from a language perspective they are now fully
supported, and hopefully we’ll make them run even faster in a not too
distant future. But I doubt we’ll add more bits, so the numerics are
always going to be dubious.