Supporting half-precision floats is really annoying
I just added half-precision floats to Futhark, as the type f16
.
This wasn’t particularly hard (perhaps a day of work), but it was
annoying for fairly shallow technical reasons, so as a bit of
catharsis, here’s a blog post about the challenges I encountered.
Why
First of all, why do we even want 16-bit floats? Their precision is horrible and the largest representable value is a mere 65504. Apparently 16-bit floats were used by various exotic processors in the early 80s, but the modern lineage, as eventually standardised by IEEE 754-2008, hails from graphics programming. This is because half-precision floats can be useful for expressing light intensities, as they are accurate enough, while allowing much larger dynamic range than a 16-bit fixed point number. Initially, half-precision was only used as a storage format, but eventually GPUs grew support for operating directly on half-precision floats; often with twice the arithmetic throughput. Since they only take up half as much space in memory as single-precision floats (and hence also use half the memory bandwidth), they are in theory “twice as fast” as single-precision floats.
Half-precision floats have also become increasingly popular for use in machine learning applications, as it appears neural networks are resistant to numerical problems (presumably they just train around them). But this is where things get interesting: there are actually (at least) two half-precision float formats. All take up 16 bits in memory, but differ in how they allocate those bits to significand and exponent. The one we support in Futhark is the binary16 format from IEEE 754-2008, but Google has also developed the bfloat16 format for use in AI accelerators. ARM apparently also supports a nonstandard variation of binary16, but I’m hoping I will never have to learn the details.
So we want to support half-precision floats because some applications can make use of them, either because they take up less space or because computation on them is actually faster.
How
Half-precision floats are standardised in IEEE 754-2008, but they are
not a universally supported type in common programming languages.
In particular, ISO C does not support them. Other languages have
unusual restrictions where half
is supported as a “storage type”
(meaning you can put them in memory), but you need to convert them to
some other type (e.g. float
) if you actually want to do arithmetic
on them. This is not how we want to do it in Futhark. The f16
type
should be fully supported like any other primitive type, and on all
systems. When necessary, the compiler must generate code to paper
over lack of hardware support.
C
Most of Futhark’s backends generate C code - either plain CPU C,
OpenCL C, or CUDA C. Let’s start with plain C, which does not support
a half
type. One option would be representing f16
as a 16-bit
integer and then implementing every single arithmetic operation
manually. People have done so, but
such code runs extremely slowly. Instead, we simply typedef float f16
and actually perform the f16
operations in single precision.
However, we don’t want these simulated floats to take up 32 bits per
element in memory, so we distinguish between the scalar type used
for operations, which is float
, and the storage type used for
arrays, which is uint16_t
. When reading an f16
from an array, we
must use a function to convert from the storage type to the scalar
type - fortunately, the library linked above contains such a function
and its inverse. This means that f16
operations might produce
different results depending on whether they are emulated or not, as
emulation only performs the rounding at the end, while true f16
calculations might also round intermediate results. I’m not
sufficiently experienced with numerical analysis, or the guarantees of
IEEE 754, to say whether this is a real problem.
We also have a problem with Futhark’s C
API.
Normally an entry point that returns a scalar will be exposed as a C
function that produces a value of the obvious type (int32_t
for
i32
and so on), but C does not have a half
type. Again we fall
back to the notion of “storage type” and return f16
results as
uint16_t
. Life would be so much easier if half
was universally
supported.
OpenCL
Let’s move on to the OpenCL backend. Surely life is easier there!
No, it never is. OpenCL does actually define a half
type, but
only as a storage format, meaning you can have pointers to them, but
not perform any arithmetic, unless the platform in question supports
the
cl_khr_fp16
extension (and you enable it). OpenCL platforms are loaded
dynamically, followed by run-time compilation of kernels, so the
Futhark compiler does not know at compile-time whether the target platform
supports half-precision. Therefore we generate kernel code with
plenty of
#ifdefs
to detect whether the platform supports cl_khr_fp16
, and fall back
to float
-based emulation otherwise. Remarkably, OpenCL does
provide builtin functions for efficiently translating between
single-precision and half-precision floats stored in
memory,
even for those platforms that don’t have cl_khr_fp16
. This lets us
load half-precision floats into single-precision scalars at quite high
speed. I observed about an order of magnitude difference compared to
the hand-written conversion function. This is quite important, since
for some reason, NVIDIA’s OpenCL implementation does not support
cl_khr_fp16
, despite their GPUs being fully capable.
As a delightful quirk, the OpenCL emulator
Oclgrind claims to support
cl_khr_fp16
, but will fail to actually perform any operations on
half
values. Futhark users will never use Oclgrind, but it is an
absolutely indispensable tool for hacking on the compiler, and is used
in our regression test suite as well. The solution is an ad-hoc
check to always emulate half-precision on
Oclgrind.
CUDA
Okay, time for CUDA. NVIDIA has been marketing half-precision hard, so surely everything will just work out, right?
Sure.
So, CUDA does indeed support half-precision floats on devices that are
Compute Capability 6.0 or newer. This can be checked with an
#ifdef
. However, for some strange reason, you have to include a
special header file, cuda_fp16.h
, to actually get access to the
half
type and its operations. Glancing at the header file, it looks
like half
is actually implemented as a C++ class with inline PTX
(CUDA assembly) to perform the computations using the natively
supported half-precision instructions. This is quite baroque. Why
isn’t the half
type just implemented in the compiler, like it is in
OpenCL?
I suspect the reason is that CUDA is mostly a “single source” programming model where the same program contains both CPU and GPU code, where the CPU code is ultimately compiled with the system compiler - which does not support half-precision floats, and so needs compatibility shims. In contrast, OpenCL C is specifically for “device” code (read: GPU code), and the “host” (CPU) is completely out of its purview. Futhark does not use CUDA in a “single source” manner, but instead uses NVRTC to do run-time compilation of CUDA kernels, very similar to how OpenCL works.
The need to include cuda_fp16.h
is not just a compiler
implementation curiosity - it turns out to present the greatest
annoyance in this entire cavalcade of accidental complexity.
Apparently when you run-time compile a CUDA kernel with NVRTC, the
default search path does not include the CUDA include directory, so
this header cannot be found. Whoever calls the NVRTC API must
explicitly provide the directory containing this header. This is
bad. I’ve long sought to ensure Futhark didn’t contribute to the
growing accidental complexity
disaster, with new
configuration files and environment variables being my main aversion.
But now it seems we have no choice. Futhark must be able to figure
out, at run-time, where CUDA is installed, and there is no standard
for this. Sure, it’s often in /usr/local/cuda
, but it’s fairly
common to put it somewhere in /opt
as well, especially on
supercomputer systems. And there is no standard environment
variable pointing at the CUDA directory! Other GPU applications
respect any or all or CUDA_HOME
, CUDA_PATH
, and CUDA_ROOT
. And,
to my great regret, now Futhark does as
well.
This is a tragedy and a shame. There is no good reason that is has to be this way. I know that this will result in things not working down the line, just because a user didn’t set an invisible global variable correctly. I still hope that I just missed a subtle detail somewhere, and that this will eventually go away like a bad dream.
Gotta go fast?
Half-precision floats are the worst IEEE floats in all ways except
speed. So are they faster in Futhark? Sort of. On an A100 GPU, the
expression f16.sum
can sum one billion half-precision floats in
2.1ms, while f32.sum
takes 2.9ms to sum one billion
single-precision floats. That’s a speedup of 1.38x, which is much
less than the 2x we would theoretically expect. So what’s going
wrong? Two things:
Adding two half-precision floats on their own is probably not faster than adding two single-precision floats. CUDA’s half-precision API strongly implies one should consider using the
half2
type, which packs two half-precision numbers in 32 bits, and allows “vectorised” operations on both in a single instruction. As far as I understand current hardware, this is the way half-precision floats can achieve high arithmetic throughput. I don’t know if the CUDA compiler will optimise multiplehalf
s intohalf2
s on its own, but I doubt it - especially for a reduction.The analysis above is valid, but it’s not actually what is happening here. This summation is completely bandwidth-bound, so the only thing that matters is how efficiently we access memory. In
f16.sum
, each thread reads a 16-bit value at a time. While the Futhark compiler arranges things so that the accesses are fully coalesced, the GPU’s memory architecture is not really designed for individual threads issuing such small reads - perhaps each thread will actually read (at least) 32 bits, then throw half of it away. In any case, the result is that the memory bus is not fully utilised.
A more efficient half-precision reduction would have each thread read
multiple packed values with a single transaction, such as by reading a
32-bit half2
. This is a trick we’ll have to teach the compiler at
some point. This is also not a new concern - if you write a program
that sums 8-bit or 16-bit integers, you are also not fully utilising
the memory bus, even if in absolute runtime, it’s still faster than
summing the same number of 32-bit integers.
How do we verify that this analysis is correct? We can write a
program that interprets an array of u32
values as packed f16
s,
using a map
-reduce
composition:
16)) +
map (\x -> f16.from_bits(u16.u32(x>>
f16.from_bits(u16.u32(x))) >-> f16.sum
This can sum the equivalent of a billion f16
s (that is, half a
billion u32
s) in a mere 1.5ms - a 1.93x speedup over f32
summation.
So should you use single-precision floats?
Unless you know what you are doing, you probably should not use f16
in your code. Their numerical properties are absolutely going to bite
you. Interoperability is awkward and the performance isn’t even that
great yet. But from a language perspective they are now fully
supported, and hopefully we’ll make them run even faster in a not too
distant future. But I doubt we’ll add more bits, so the numerics are
always going to be dubious.